How to do simple web scraping

When we are working in performance, sometimes we just need to do a few tweaks on an HTML page, like changing the order of some scripts loading in order to test how that affects the site speed. This is just a use case, there can be multiple other reasons. In order to do that, we can follow an easy and clean approach which is scraping the site and implement the changes that we need to test.

Here is the general guideline:

  • Use the npm package website-scraper.
    • npm install website-scraper
    • Make sure you update your package.json in order to activate ES modules
    • "type": "module"
    • Create a source file (ex: main.js) for configuring the scraping as follows:
    • const scrape = require('website-scraper');
      const options = {
          urls: [
          'http://www.site.com/',
          {
            url: 'http://www.site.com/about', 
            filename: 'about.html'
          },
          {
            url: 'http://www.site.com/product-123', 
            filename: 'product-123.html'
          }
          ],
          directory: './site',
      };
                                      
      (async () => {                                
          await scrape(options);                                
      })();
    • Execute the file using node.js
    • node main.js
  • Host the scraped site. For instance, you can use http-server

That's it!

This example just scrapes the pages that are configured, but you can make the scraper follow all hyperlinks in HTML pages, making a complete sweep of the site's content.

This approach removes a lot of overhead since it doesn't care about the backend technology behind the site. It simply creates static content that you can host. This makes front end performance testing a lot easier because we are removing most of the web server internal latency.