Build a simple web scraper with node.js

Recently I released my first personal iOS app into the wild. The app is called unit guide for Starcraft 2 and it provides Starcraft 2 player with up to date and accurate information for every unit in the game. Instead of manually creating a huge JSON file I wrote a web scraper in node.js that allows me to quickly extract all the data I need and output it in a JSON format. In this post I will explain how you can build something similar using techniques that are familiar for most web developers.

Step 1: preparing

Before you get started you’re going to want to install some dependencies. The ones I have used are: request, cheerio and promise. Installing them will work like this:

If you don’t have npm installed yet then follow the instructions here to install node and npm.

Once you have all the dependencies, you’re going to need a webpage that you will scrape. I picked the Starcraft 2 units overview page as a starting point. You can pick any page you want, as long as it contains some data that you want to extract into a JSON file.

Step 2: loading the webpage

In order to start scraping the page, we’re going to need to load it up. We’ll be using request for this. Note that this will simply pull down the html, for my use case that was enough but if you need the webpage to execute javascript in order to get the content you need you might want to have a look at phantomjs. It’s a headless browser that will allow javascript execution. I won’t go in to this right now as I didn’t need it for my project.

Downloading the html using request is pretty straightforward, here’s how you can load a webpage:

Getting the html was pretty easy right? Now that we have the html we can use cheerio to convert the html string in to a DOM-like object that we can query with css style selectors. All we have to do is include cheerio in our script and use it like this:

That’s it. We now have an object that we can query for data pretty easily.

Step 3: finding and extracting some content

Now that we have the entire webpage loaded up and we can query it, it’s time to look for content. Or in my case, I was looking for references to pages that contain the actual content I wanted to extract. The easiest way to find out what you should query the DOM for is to use the “inspect element” feature of your browser. It will give you an overview of all of the html elements on the page and where they are in the page’s hierarchy. Here’s part of the hierarchy I was interested in:

Screen Shot 2016-02-29 at 15.42.15

You can see an element that has the class table-lotv in the hierarchy. This element has three children with the class unit-datatable. The contents of this unit-datatable are of interest for me because somewhere in there I can find the names of the units I want to extract. To access these data tables and extract the relevant names you could use a query selector like this:

In the above snippet $('.table-lotv .unit-datatable') selects all of the data tables. When I loop over these I have access to the individual dataTable objects. Inside of these objects I have found the race name (Terran, Protoss or Zerg) which is contained inside of a span element which is contained in an element with the class title-bar. Extracting the name isn’t enough for my use case though. I also want to scrape each unit’s page and after doing that I want to write all of the data to a JSON file at once. To do this I used promises. This is a great fit because I can easily create an array of promise objects and wait for all of them to be fulfilled. Let’s see how that’s done, shall we?

Step 4: build your list of promises

While we’re looping over the dataTable objects we can create some promises that will need to be fulfilled before we output the big JSON file we’re aiming for. Let’s look at some code:

Okay, so in this snippet I included Promise to the requirements. Inside of the request callback I created an empty array of promises. When Looping over the data tables I insert a new promise which is returned by the scrapeUnits function (I’ll get to that function in the next snippet). After looping through all of the data tables I use the Promise.all function to wait until all promises in my promises array are fulfilled. When they are fulfilled I use the results of these promises to populate a data object (which is our JSON data). The function we provide to the then handler for Promise.all receives one argument. This argument is an array of results for the responses we put in the promises array. If the promises array contains three elements, then so will the promiseResults. Finally I write the data to disk using fs. Which is also added in the requirements section. (fs is part of node.js so you don’t have to install that through npm).

Step 5: nesting promises is cool

In the previous snippet I showed you this line of code:

The function scrapeUnits is a function which returns a promise, let’s have a look at how this works, shall we?.

This function is pretty straightforward. It returns a new Promise object. A Promise object takes one function as a parameter. The function should take two arguments, fulfil and reject. The two arguments are functions and we should call them to either fulfil the Promise when our operation was successful, or we reject it if we encountered an error. When we call fulfil, the Promise is “done”. When we use Promise.all, the then handler will only get called if all promises passed to all have been fulfilled.

Step6: Putting it all together

The above script is a stripped version of the code I wrote to scrape all of the unit information I needed. What you should take away from all this, is that it’s not very complex to build a scraper in node.js. Especially if you’re using promises. At first promises might seem a bit weird, but if you get used to them you’ll realise that they are the perfect way to write maintainable and understandable asynchronous code. Especially Promise.all is a very fitting tool for what we’re trying to do when we scrape multiple webpages that should be merged into a single JSON file. The nice thing about node.js is that it’s javascript so we can use a lot us the technology we also use in a browser. Such as the css / jQuery selectors that cheerio makes available to us.

Before you scrape a webpage, please remember that not every webpage owner appreciates it if you scrape their page to use their content so make sure to only scrape what you need, when you need it. Especially if you start hitting somebody’s websites with hundreds of requests you should be asking yourself if scraping this site is the correct thing to do.

If you have questions about this article, or would like to learn more about how I used the above techniques, you can let me know on Twitter

Service workers are awesome

In the war between native and web apps there’s a few aspects that make a native app superior to a web app. Among these are features like push notifications and offline caching. A native app, once installed, is capable of providing the user with a cache of older content (possibly updated in the background) while it’s fetching new, fresh content. This is a great way to avoid loading times for content and it’s something that browser vendors tried to solve with cache manifests and AppCache. Everybody who has tried to implement offline caching for their webpages will know that the AppCache manifest files are a nightmare to maintain and that they’re pretty mysterious about how and when they store things. And so the gap between native and web remained unchanged.

In comes the service worker

The service worker aims to solve all of our issues with this native vs. web gap. The service worker will allow us to have a very granular controlled cache, which is great. It will also allow us to send push notifications, receive background updates and at the end of this talk Jake Archibald mentions that the Chrome team is even working on providing stuff like geofencing through the service worker API. This leads me to think that the service worker just might become the glue between the browser and the native platform that we might need to close the gap once and for all.

If you watch the talk by Jake Archibald you’ll see that the service worker can help a great deal with speeding up page loads. You’ll be able to serve cached content to your users first and then add new content later on. You’re able to control the caching of images that aren’t even on your own servers. And more importantly, this method of caching is superior to browser caching because it will allow for true offline access and you can control the cache yourself. This means that the browser won’t just delete your cached data whenever it feels the need to do so.

How the service worker works

When you want to use a service worker you have to install it first. You can do this by calling the register function on  navigator.serviceWorker . This will attempt to install the service worker for your page. This is usually the moment where you’ll want to cache some static assets. If this succeeds the service worker is installed. If this fails the service worker will attempt another install the next time the page is loaded, your page won’t be messed up if the installation fails.

Once the service worker is installed you can tap into network requests and respond with cached resources before the browser goes to the network. For example, the browser wants to request /static/style.css . The service worker will be notified through the  fetch event and you can either respond with a cached resource or allow the browser to go out and fetch the resource.

HTTPS only!!

Because the server worker is such a powerful API it will only be available through HTTPS when you use it in production. When you’re on localhost HTTP will do but otherwise you are required to use HTTPS. This is to prevents man-in-the-middle attacks. Also, when you’re developing locally you can’t use the file:// protocol, you will have to set up a local webserver. If you’re struggling with that, I wrote this post that illustrated three ways to quickly set up an HTTP server on your local machine. When you want to publish a demo you can use github pages, these are server over HTTPS by default so service workers will work there.

A basic server worker example

Browser support

Before I start with the example I want to mention that currently Chrome is the only browser that supports service workers. I believe Firefox is working hard to make an implementation happen as well and the other vendors are vague about supporting the service worker for now. This page has a good overview of how far the completion of service workers is.

The example

The best way to illustrate the powers of the service worker probably is to set up a quick demo. We’re going to create a page that has ten pretty huge pictures on it, these pictures will be loaded from several resources because I just typed ‘space’ in to Google and picked a bunch of images there that I wanted to include on a webpage.

When I load this page without a service worker all the images will be fetched from the server, which can be pretty slow considering that we’re using giant space images. Let’s speeds things up. First create an app.js  file and include that in your page html right before the body tag closes. In that file you’ll need the following script:

This code snipper registers a service worker for our website. The register function returns a promise and when it gets resolved we just log the words ‘success’ for now. On error we’ll log failure. Now let’s set up the service worker.

The code above creates a new service worker that adds a list of files to the  "SPACE_CACHE" . The install eventHandler will wait for this operation to complete before it returns a success status, so if this fails the installation will fail as well.

Now let’s write the fetch handler so we can respond with our freshly cached resources.

This handler will take a request and match it against the SPACE_CACHE. When it finds a valid response, it will respond with it. Otherwise we will use the fetch API that is available in service workers to load the request and we respond with that. This example is pretty straightforward and probably a lot more simple than what you might use in the real world.

Debugging

Debugging service workers is far from ideal, but it’s doable. In Chrome you can load chrome://serviceworker-internals/ or chrome://inspect/#service-workers to gain some insights on what is going on with your service workers. However, the Chrome team can still improve a lot when it comes to debugging service workers. When they fail to install properly because you’re not using the cache polyfill for instance, the worker will return a successful installation after which the worker will be terminated without any error messages. This is very confusing and caused me quite a headache when I was first trying service workers.

Moving further with service workers

If you think service workers are interesting I suggest that you check out some examples and posts online. Jake Archibald wrote the <a href=”http://jakearchibald.com/2014/offline-cookbook/” target=”_blank”>offline cookbook</a>. There’s many information on service workers in there. You can also check out his <a href=”https://github.com/jakearchibald/simple-serviceworker-tutorial” target=”_blank”>simple-serviceworker-tutorial</a> on Github, I learned a lot from that.

In the near future the Chrome team will be adding things like push notifications and geofencing to service workers so I think it’s worth the effort to have a look at them right now because that will really put you in the lead when service workers hit the mainstream of developers and projects.

If you feel like I made some terrible mistakes in my overview of service workers or if you have something to tell me about them, please go ahead and hit me up on Twitter.

The source code for this blog post can be found on Github.

Death by papercut (why small optimizations matter)

It’s not easy to write good code. It’s also not easy to optimize code to be as fast as possible. Often times I have found myself refactoring a piece of code multiple times because I could make the code easier to read or perform faster. Sometimes I’ve achieved both. But when a project would grow larger and larger things would still feel a little slow after a while. For instance, changing something from doing four API calls to six wouldn’t matter that much, right? I mean, each call only takes about 10ms and everything is very optimized!

This is death by papercut

In the example I mentioned above, doing two API calls, 10ms each, isn’t a very big deal in itself. What is a big deal though, is the fact that we’re doing four other calls as well. So that means that we went from 40ms in API calls to 60ms. that’s a 50% increase. And then there’s also the overhead of possibly extracting the data you want to send to the API from the DOM and also parsing the API’s response on the client will take extra time. Slowly all these milliseconds add up. On their own every part of the application is pretty fast, but when put together it becomes slower and slower.

I’m not sure where I heard this first but I think it’s a great one. All these not-so-slow things stacking up and then becoming slow is like dying from papercuts. Each cut on it’s own isn’t a big deal, but if you do enough of them it will become a big deal and eventually you die a slow and painful death. The same is true in web development, and any other development field for that matter.

So if we should avoid dying by a papercut, and papercuts on their own don’t really do damage, how should we do that? Every block of code we write will cost us a very tiny bit of speed, that’s inevitable. And we can’t just stop writing code in order to keep everything fast, that’s not an option.

Staying alive

The death by papercut scenario is a difficult one to avoid, but it’s also quite simple at the same time. If you take a good hard look at the code you write you can probably identify small bits of code that execute but don’t actually do much. Maybe you can optimize that? Or maybe you’re doing two API calls straight after each other. One to fetch a list and a second one to fetch the first item on that list. If this is a very common pattern in your application, consider adding the first item in that initial API response. The win won’t be huge, but imagine finding just 4 more of these situations. You may have just won about 50ms of loading time by just optimizing the obvious things in your API.

Recently I came across an instance in my own code where I wanted to do a status check for an X amount of items. Most of the items would keep the default status and I used an API to check all statuses. The API would check all of the items and it would return the ‘new’ status for each item. It was pointed out to me that the amount of data I sent to the API was more than I would strictly need to identify the items. Also, I was told that returning items that won’t change is a waste of bandwidth, especially because most items wouldn’t have to change their status.

This might sound like 2 very, very small changes were made:

  • Send a little less data to the API
  • Get a little less data back from the API

As small as these changes may seem, they actually are quite large. First of all, sending half of an already small request is still a 50% cut in data transfer. Also, returning less items isn’t just a bandwidth win. It’s also a couple less items to loop over in my javascript and a couple less DOM updates. So this case save me a bunch of papercuts that weren’t really harming the project directly, but they could harm the project when it grows bigger and bigger and more of these small oversights stay inside of the application, becoming harder and harder to find.

Call it over optimizing or call it good programming

One might argue that what I just described is over optimization, which is considered bad practice because it costs quite some effort and yields only little wins. In this case, because the functionality was still being built it was easy to see one of these small optimizations. And the right time to do a small optimization is, I guess, as soon as you notice the possibility to do that optimization.

I haven’t really thought of code in this mindset often, but I really feel like I should. It keeps me on top of my code and I should be able to write more optimized code right from the start. Especially when it comes to data fetching from an API or from a database, I think a very careful approach might be a good thing. Let’s not send a request to the server for every little thing. And let’s not run to the database for every little thing all the time, that’s what caching is for. The sooner you optimize for these small things, the easier it is and the less trouble it will cause you later on.

Don’t depend on javascript to render your page.

Today Christian Heilmann posted this tweet, demonstrating a rather huge delay between page load and javascript execution. This huge delay made the page show things like {{venue.title}} for an awkward amount of time. Once the page has loaded for the first time you can refresh it and the {{venue.title}} won’t show. What it does still show, however, is a wrong page header (the venue title is missing) and the font isn’t loaded straight away either. So I guess we all agree that there’s several things wrong with this page and at least one of them is, in my opinion, completely unnecessary.

What’s actually going wrong here?

From what I can tell this page is using Angular.js. I’m not going to rant about Angular and what might be wrong with it, there’s plenty of other people out there who do that. What I will say, however, is that the approach this website uses to rendering rather simple data is wrong. This website is serving you a webpage from their server. Then the browser has to read that page, load all assets and read them. So the page will only start looking better after all the javascript was loaded and executed. If this sound terrible to you that’s because it is.

Not only did Christian post a Youtube video, he also ran the page through Web Page Test. This produced another video that shows the page loading with a timer. It takes six(!) seconds for the correct title to appear in this video. It takes twenty-five seconds for the page to be complete.  For a webpage, six second is a long time, twenty-five seconds is an eternity. People might not even stay on your page for six seconds, let alone twenty-five.

So why is the page this slow? Angular is pretty fast, or at least computers are fast enough to make what Angular does really fast. So something fishy must be going on here. Let’s use some inspector tools to find out what this page is doing and where it’s spending time at.

Loading the page and assets

When loading the page using Safari with timelines enabled we can see how this page is loading it’s assets and how rendering is going on. The first thing I’ll look at is the networking section to get an idea of how everything is being loaded.

Schermafbeelding 2015-03-14 om 11.11.42

As you might expect, the browser will first load the page itself. The page is 10KB, which isn’t bad. A website like reddit weighs in at about 125KB. Then the styles load, they’re about 240KB, rather large. Especially if you consider what the page looks like. Then a 5KB print stylesheet is loaded as well, nothing wrong with that.

And then comes an almost 500KB(!) libraries file. That’s huge. Like, really huge. That file alone almost takes a full second to load over my fast cable connection.

At the 577ms mark we start loading the actual information that should be displayed on the page. There’s a full second of latency before that file starts loading. And once that file is loaded (and parsed) the title can be displayed. This result is a little different from the web page test, probably because I have really fast internet and a fast machine and maybe a bit of luck was involved as well, but almost two full seconds before anything happens is a long time. That’s almost two seconds after the initial page loaded.

What does this mean?

The analysis that I’ve done above isn’t very in-depth, and to make my point it doesn’t have to be. What the above statistics tell us that it took the webpage almost two seconds to go from loading html to displaying a rendered page. Not everything on the page was loaded properly yet, it just took about two seconds to go from {{venue.title}} to displaying the venue’s actual title. What if I told you that there’s no good reason for this to take two seconds or even six seconds in the web page test video? And what if I tell you that I could get that title on the page in about zero seconds?

Doing this better

The title of this article says that you shouldn’t rely on javascript to render your page and I believe that the better.org.uk is a perfect example of this. This website could be improved an insane amount if they would just render that initial view on their server. I don’t see single valid reason for the server to not pick up that .json file containing the venue info, parse it and put it’s contents on the page. Doing this would result in {{venue.title}} not even being in the html at all. Instead it will contain the actual venue title. This goes for all the venue info on the page actually. Why load that after loading the page and all the javascript and waiting for the javascript to be parsed? There are times where you might want to do something like this but that’s usually after user interactions. You might want to use javascript to load a list of search results while a user is typing to provide real-time searching for example. But I don’t think javascript should be used to load page critical information, that’s the server’s job.

Wrapping this up

The example I used is a really slow website, not all websites are this slow and some website don’t suffer from that flash of {{venue.title}}. But my point still stands for those websites, they are slower than they need to be because they load page critical data after the page is loaded. I want to thank Christian Heilmann for tweeting about this issue and allowing me to use his resources for this post. This issue seems to be a real one and everybody seems to forget about it while they fight over which framework is the best, fastest and most lightweight. Not depending on the framework to render your page is the fastest, best and most lightweight way there is. You can find me on Twitter if you have pointers, corrections or opinions on this. Thanks for taking the time to read this!

Sharing cookies between subdomains

A while ago I wrote about sharing cookies on an old Tumblr blog of mine. I figured it contains some valuable information so I migrated the post here. The reason I wrote it was that I was working on a pitch that would take a user through a series of puzzles. Each puzzle would enable the user to see the next website(subdomain) and I decided I’d save the progress of the user in a cookie.

Normally you’d define a cookie for a domain like www.example.com. When the browser tries to read this cookie, the current domain has to be www.example.com. So sub.example.com wouldn’t work and the cookie won’t be read. There is, however, a way to get around this security limitation and it’s quite easy.

If you only specify .example.com(note the period before example.com) as the host for your cookie, it will be accessible for every subdomain from this host. So both www.example.com and sub.example.com can use this cookie.

An example in code (using the jquery cookie plugin):

Well, hope it’s helpful to some of you out there.