Building Inverted Index with Hadoop MapReduce (for a search engine)

Overview

In this article I am going to demonstrate how to build inverted index from crawled web page data for fast web look up. This type of index is used by all search engines to quickly find search term you normally type into search box and also used in some DBs for faster look-ups.

To give brief description on inverted index, it is an index where every word in the db has related list of web pages i.e. you type news it gives you bbc.co.uk, guardian.co.uk etc. This, provides efficient way looking up web for related data. The down side of it obviously is that a need to build and maintain a huge database of data (index) for this quick look up to work. Also, since it’s a vast amount of data we need to store it in distributed computer environment and use sophisticated programs to pin point correct machine in the network to retrieve this information. Actually, this is the main reason Hadoop and MapReduce is build for initially.

Objective

In this article I am going to go through an example which I used to build inverted index when I was working on Jumponio.com (now obsolete) project at one point to be stored eventually in Amazon Dynamo DB.

Preparation

For this example I have already crawled the web and acquired all metadata needed and saved in JSON format using method I explained in my previous post. My development and testing environment is as per below:

Java - for MapReduce code
Maven - dependency management and package builder
Eclipse - IDE
Docker - hadoop cluster running on Docker containers
Ubuntu - of course…. ;)

Although, you can run Hadoop in your local machine in Pseudo-Distributed mode you can also run it in fully distributed mode using method I explained in my previous posts on how to set up Hadoop cluster on Docker containers. I am going to use distributed mode and run it my own cluster.

Data

As I mentioned before, I have already crawled data got it in the local host as per below:

khalid@ubuntu:/JumpOnCrawler/out/original$ ls -l | head
total 504460
-rw-rw-r-- 1 khalid khalid   11233 Oct 22  2016 0044.co.uk.json
-rw-rw-r-- 1 khalid khalid   11405 Oct 23  2016 007escorts.co.uk.json
-rw-rw-r-- 1 khalid khalid   20978 Oct 22  2016 01100111011001010110010101101011.co.uk.json
-rw-rw-r-- 1 khalid khalid       0 Oct 22  2016 01casinos.co.uk.json
-rw-rw-r-- 1 khalid khalid   16134 Oct 22  2016 020.co.uk.json
-rw-rw-r-- 1 khalid khalid   18288 Oct 23  2016 08direct.co.uk.json
-rw-rw-r-- 1 khalid khalid       0 Oct 23  2016 10001games.co.uk.json
-rw-rw-r-- 1 khalid khalid       0 Oct 23  2016 1000-vouchers.co.uk.json
-rw-rw-r-- 1 khalid khalid      75 Oct 22  2016 1001hosting.co.uk.json

A total 504460 crawled files in JSON format.

Just to give you an idea below is how the first file looks like:

khalid@ubuntu:/media/khalid/home/feo/crawler/JumpOnCrawler/out/original$ head 0044.co.uk.json
[
{"url": "/", "base": "http://www.0044.co.uk/", "description": "International SIM cards & Data SIM cards help you to avoid all roaming charges.", "title": "International SIM cards & Cheap International Calls"},
{"url": "/about-us.htm#terms", "text": "Terms & Conditions", "base": "http://www.0044.co.uk/", "description": "0044 Ltd have been specialists for over 10 years in the sourcing the best global SIM Cards for cheap international calls and data roaming.", "title": "About 0044 Ltd Global SIM Cards and Cheap International Calls"},
{"url": "http://www.0044.co.uk/contact-us.htm", "text": "contact us", "base": "http://www.0044.co.uk/", "description": "We can help and advise about all aspects of global sim cards and cheap international calls", "title": "Global SIM Card advice - Contact 0044"},
....

Now, we need to upload it into my Hadoop cluster:


hdfs dfs -put original /users/khalid/

Application

Now, the application part. I have already developed and uploaded the code into my repository where you can fork from.


git clone https://github.com/khalidmammadov/hadoop_reverse_index_mapreduce.git

Here, I split the app into the three parts the Map, Reduce and Main classes. Lets have a look into each one:

Here the Mapper gets every line from the file as a text and then it converts it to JSON object. Then after some check the values from {“keywords”, “description”, “title”} keys are parsed in the word processor where it extracts every word, cleans and de-duplicates using java Set class. Then keys from the set are submitted to the context.

De-duplication and clean

ParserMap.java

String[] words = obj.has(key) ? obj.getString(key).split("\\s|,") : null;

if (words != null) {
    for (String val: words) {

        if (val.trim().length() > 0) {

            uniqueList.add(val.trim().replaceAll("[\\W+|\\\"]", ""));

        }

    }
}

Mapper

int lastIdx = line.length() - ((line.substring(line.length() - 1)) == "," ? 1 : 0);

JSONObject obj = null;
try {
    obj = new JSONObject(line.substring(0, lastIdx));
} catch (JSONException e) {
    return;
}

if (obj != null) {

    Set < String > strSet = new HashSet < String > ();

    if (!obj.has("keywords") || !obj.has("base")) {
        return;
    }

    String[] keys = {
        "keywords",
        "description",
        "title"
    };

    for (String s: keys) {
        setWord(obj, s, strSet);
    }

    if (!strSet.isEmpty()) {

        for (String w: strSet) {

            word.set(w);

            context.write(word, new Text(obj.getString("base")));

        }
    }

}

In reducer phase, the app de-duplicates values i.e. web addresses and join locations (where that word was mentioned) with pipe (

) and submits them into context.

ReverseIndexReducer.java

public void reduce(Text key, Iterable < Text > values, Context context) throws IOException, InterruptedException {

    String locations = "";

    Set <String> strSet = new HashSet <String> ();

    //Remove duplicates
    for (Text val: values) {
        strSet.add(val.toString());
    }
    //Join URLs
    for (String str: strSet) {
        locations = String.join("|", locations, str);
    }

    result.set(locations);
    context.write(key, result);

}

Finally, the runner app where the all pieces come together. Here we set mapper and reducer classes. Also, we indicate that input and output directories are supplied externally. You can see in the code I have left some commented lines. They are there for local testing purposes (hence, you can see input and output directories in the git repository). Additionally, I have added a line of code to delete target folder before starting so it does not fail for reruns. Here is the code:

Main.java

public int run(String[] args) throws Exception {

        Job job = Job.getInstance(getConf(), "reverse index builder");

        job.setJarByClass(Main.class);
        job.setMapperClass(ParserMap.class);
        job.setReducerClass(ReverseIndexReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        Path inputFilePath = new Path(args[0]);
        Path outputFilePath = new Path(args[1]);
        FileInputFormat.addInputPath(job, inputFilePath);
        FileOutputFormat.setOutputPath(job, outputFilePath);

        //For local run
        //Path inputFilePath = new Path("/home/khalid/workspace/hadoop_reverse_index_mapreduce/input");
        //Path outputFilePath = new Path("/home/khalid/workspace/hadoop_reverse_index_mapreduce/output/1");
        //FileInputFormat.addInputPath(job, inputFilePath);
        //FileOutputFormat.setOutputPath(job, outputFilePath);

        FileSystem fs = FileSystem.get(new URI(outputFilePath.toString()), getConf());
        fs.delete(outputFilePath, true);

        return job.waitForCompletion(true) ? 0 : 1;

}

Running the code

Before we can run the code we need to package it up. As mentioned I use Maven and it’s great!

mvn clean
mvn package

Ok, now we can run it.

HADOOP_CLIENT_OPTS="-Xmx4g" hadoop jar target/hadoop_reverse_index_mapreduce-0.0.1-SNAPSHOT.jar co.uk.khalidmammadov.hadoop.Main \
hdfs://192.168.1.5:9000/users/khalid/input \
hdfs://192.168.1.5:9000/users/khalid/out

you might wonder what the heck is this?: HADOOP_CLIENT_OPTS=”-Xmx4g” This is here to stop my job from failure due to insufficient memory on the client so I am increasing it to 4Gb.

After running through 504460 files (~5min in my case) it finished successfully and produced output files:

khalid@ubuntu:$ hdfs dfs -ls hdfs://192.168.1.5:9000/users/khalid/out
Found 2 items
-rw-r--r--   3 khalid khalid          0 2017-12-25 22:22 hdfs://192.168.1.5:9000/users/khalid/out/_SUCCESS
-rw-r--r--   3 khalid khalid   97350207 2017-12-25 22:22 hdfs://192.168.1.5:9000/users/khalid/out/part-r-00000

Let’s check out what it is inside part-r-00000. It’s has got too much and long data so, I will put here only snippets to show the point:

giftcard	|http://www.hotdeals.co.uk/|http://www.rspb.org.uk/|http://www.currys.co.uk/gbuk/index.html
giftcards	|https://www.amazon.co.uk/|http://www.thegiftcardcentre.co.uk/
giftcardtcs	|http://blissworld.co.uk
giftcertificates	|https://www.spreadshirt.co.uk/
giftsCrabtree	|http://www.crabtree-evelyn.co.uk/home.html
giftsets	|https://www.solippy.co.uk/|https://www.eastendcosmetics.co.uk/
gifts	|http://www.flyerzone.co.uk/|http://deeplinkdirectory.co.uk|https://www.spreadshirt.co.uk/|http://wish.co.uk|http://www.adventuretravelmagazine.co.uk/|http://www.printster.co.uk/|http://www.expressgiftservice.co.uk/|http://www.designitnow.co.uk/|http://www.emmabridgewater.co.uk/

Summary

As you can see we have got a URL for every word we are after and we can find related information hopefully :) easily with our own search engine! It’s hopefully because here I have not mentioned anything about search engine algorithms such as Markov Chain or Page Rank which are the essential parts to find required data, but I think this is completely different story! Please do do not hesitate to leave comments below and Thanks for reading!