Introduction to Web Scraping for Marketers

Over the span of just a few years the digital marketing landscape has evolved into a data-crazed world filled with programmatic marketing products that are aimed at mining data and making bid adjustments based on real time cost-benefit analysis. With the advent of this type of software, businesses have flocked to purchase these solutions in order to reap the exponential lifts that the platforms promise. In some instances these decisions turn out to be a slam dunk, while in other instances the results have zero impact on marketing performance and the marketing teams are forced to develop an in-house solution.

Regardless of the case that occurs, there is an inherent need for marketers to be trained on various programming languages like JavaScript and Python in order to implement 3rd party platform pixels or create a homegrown solution. This type of skill set is not typically found with the present day marketer, but will definitely be found in the skill set of the next generation marketer.

As a result, it is good to read up on basic programming languages and know enough to help reduce the need for development team resources for common marketing team requests. I am a major advocate of marketers following this path and will try to introduce such topics in blog posts like this one on web scraping.

Web Scraping

Web scraping for the purpose marketing can occur varies case to case, but almost always has an element of collecting information or cleaning data. From keeping tabs on competitor rankings or product pricing to cleaning up and categorizing internal data, the possibilities with web scraping are endless.

In the accompanying guide I will walk you through a quick tutorial on how your can build a web scraper with NodeJS.

Prerequisites:

-You must have Nodejs and NPM installed on your machine -Node.js & npm (npm is bundled with node) - Install here: https://nodejs.org/en/

-Basic understanding of JavaScript and Node.js

-Text editor program - Doesn’t matter which you use. Two popular programs are Sublime Text & Atom.io

-Command Line Interface

Note: This tutorial was created with a Mac. The only difference that the operating system makes for this tutorial is the command line interface or CLI that is used. If you are using a Mac, your default CLI will be terminal. If you are using Windows, your default CLI will be command prompt.

Let’s get to it!

Step 1) Create a file named package.json. This file contains all of the modules that are required for our web scraper to work as well and the javascript file that is the primary entry point to the program. Set your file up like so.

package.json:

{
  "name": "marketing-web-scraper",
  "version": "1.0.0",
  "description": "Web scraper for marketers",
  "main": "app.js",
  "author": "Your Name",
  "dependencies": {
    "cheerio": "^0.19.0",
    "express": "^4.13.3",
    "fs": "0.0.2",
    "request": "^2.67.0"
  }
}

After you save the file, go to your CLI and change directories to your app folder like so

cd /path/to/application-folder

and run

npm install

This will run a series of commands to install the packages from npm. If your install was successful, you will notice that a node_modules folder was created with folders created for each of the dependencies listed. These folders contain code, which will interact with our program when we require the modules within our app.

Step 2) Create a javascript file titled “app.js”, require each of the modules located in the dependencies section of the package.json and then set our URL that we want scraped with a port for the server to be accessed from. In my example I decided to scrape the headlines from one of my favorite web development sites, asiteapart.com (var url = ‘http://alistapart.com/’) Look for the commented section to the right of the code for an explanation of what the modules provide.

Please note that all URL’s have to contain http:// or https://. If an application protocol type is not included in the URL then you will receive the following error message [Error: Invalid URI "alistapart.com"]


var express = require("express"); // NodeJS framework
var fs = require("fs"); // Simple server request
var request = require("request"); // Access to the file system
var cheerio = require("cheerio"); // Server-side jQuery
var app = express();

var url = 'http://alistapart.com/' //Website for the web scrape

var port = '8080'; // Server port for the listen function

Step 3) Set up an empty array that will be used to store the scraped data and create a route with a GET request that contains a require function. When we access this route at localhost:8080/ the web scraping function will be triggered

var titles = []; //Empty array for storing scraped data

//Web Scraping Logic accessed at localhost:8080/
app.get('/', function(req, res){
)};

Step 4) Add the request function which will access our url. The if(err) clause is used to send us an error message if the request fails.

//1) Request a URL, Then check to see if there is an error accessing the website. If there is no error accessing the website, then log the HTML body of the page
  request(url, function(err, res, body){
    if(err){
      console.log(err);
    } else {
     //Step 5 code
    }

Step 5) Load the body of the page with the cheerio “load” method and set it to the ‘$’. Then traverse the body content for the specific HTML elements that you are looking to scrape. For my example I am looking to scrape the headlines of the articles listed. When looking at the source code of my URL I notice a common naming convention trend of ‘h4.summary-title’. This looks like the exact information I was looking for so I pass this to the cheerio variable and set a loop to capture all of the text on that page that fit this criteria, which will then be stored in the empty variable we set earlier in the code.

//Provide Cheerio with access to the URL's HTML
var $ = cheerio.load(body);

//Traverse the HTML for the h4 tags used for the article titles

$('h4.summary-title').each(function(i, elem){
  titles[i] = $(this).text();
});


console.log(titles);

Step 6) Write the scraped information to a .csv file that will be found within our application folder and log success text to our terminal and to our localhost page.

fs.writeFile('web-scrape-results.csv', JSON.stringify(titles, null, " "), function(err){
        if(err){
          console.log(err);
        } else {
          console.log("Data written successfully!");
        }
      });

Step 7) Go back to your CLI, run

node app.js
, and then go to
localhost:8080/
If you followed all of the steps correctly then you should have a newly created csv file called ‘web-scrape-results.csv’ with the scraped information. If you don’t see that file then you missed some parts of code along the way and should refollow the steps or tweet at me/send me an email and I will be more than happy to help!

Full Code:

***package.json***

{
  "name": "marketing-web-scraper",
  "version": "1.0.0",
  "description": "Web scraper for marketers",
  "main": "app.js",
  "author": "Connor Phillips",
  "dependencies": {
    "cheerio": "^0.19.0",
    "express": "^4.13.3",
    "fs": "0.0.2",
    "request": "^2.67.0",
    "x-ray": "^2.0.3"
  }
}
app.js
//Step 2
var express = require("express"); // NodeJS framework
var fs = require("fs"); // Simple server request
var request = require("request"); // Access to the file system
var cheerio = require("cheerio"); // Server-side jQuery
var app = express();

var url = 'http://alistapart.com/' //Website for the web scrape

var port = '8080'; // Server port for the listen function

//Step 3
var titles = []; //Empty array for storing scraped data


//Web Scraping Logic accessed at localhost:8080/
app.get('/', function(req, res){

//Step 4 

  //Request a URL, Then check to see if there is an error accessing the website. If there is no error accessing the website, then log the HTML body of the page
  request(url, function(err, res, body){
    if(err){
      console.log(err);
    } else {

//Step 5

      //Provide Cheerio with access to the URL's HTML
      var $ = cheerio.load(body);

      //Traverse the HTML for the h4 tags used for the article titles
      $('h4.summary-title').each(function(i, elem){
        titles[i] = $(this).text();
      });

      console.log(titles);

//Step 6

      fs.writeFile('web-scrape-results.csv', JSON.stringify(titles, null, " "), function(err){
        if(err){
          console.log(err);
        } else {
          console.log("Data written successfully!");
        }
      });

    }
  })
  res.send(url + " was successfully scraped!");


});

app.listen(port);

console.log('Listening on port' + port);