Enhancing the word count sample a little bit

Since starting to play with Hadoop on Windows/Azure (now HDInsight) I wanted to improve the word count sample slightly so that it ignores punctuation and very common list, but as it involved Eclipse and Java it never quite made it to the top of the list; now that it’s Visual Studio and .net I really had no excuses, so here are the two changed I’ve made to what I’ve started with in my previous post

Firstly – to remove all punctuation, i’ve added the following function –

private string removePuncuation(string word)
        {
            var sb = new StringBuilder();
            foreach (char c in word.Trim())
            {
                if (!char.IsPunctuation(c))
                    sb.Append(c);
            }

            return sb.ToString();
        }

I then added it to my map function as you can see below –

        public override void Map(string inputLine, Microsoft.Hadoop.MapReduce.MapperContext context)
        {
            string[] words = inputLine.Split(' ');

            foreach (string word in words)
            {
                string newWord = removePuncuation(word);
                context.EmitKeyValue(newWord, "1");
            }
        }

simples.

To support ignoring common words I wanted to keep the list of words outside the code, as an HDFS file, so firstly I added an initialize method override to load that list –

        private List<string> ignoreList = new List<string>();
        public override void Initialize(MapperContext context)
        {
             const string IGNORE_LIST_FILENAME = "/user/yossidah/input/ignoreList.txt";
            base.Initialize(context);
            context.Log("WordCountMapper Initialized called");
            context.Log("looking for file " + IGNORE_LIST_FILENAME);
            if (HdfsFile.Exists(IGNORE_LIST_FILENAME))
            {
                context.Log("ignore list file found");
                string[] lines = Microsoft.Hadoop.MapReduce.HdfsFile.ReadAllLines("ignoreList.txt");
                foreach (string line in lines)
                {
                    context.Log("ignore list line: " + line);
                    string[] words = line.Split(' ');
                    ignoreList.AddRange(words);
                    foreach (string word in words)
                    {
                        context.Log(string.Format("Adding {0} to ignore list", word));
                    }
                }
            }
            else
                context.Log("ignore list file not found");
        }

(I’ve added a bunch of logging I can track from the job log file)

I then added a call to the map list to consult the list of words to ignore to determine whether to call emit or not, here’s the complete map function again

public override void Map(string inputLine, Microsoft.Hadoop.MapReduce.MapperContext context) { string[] words = inputLine.Split(' '); foreach (string word in words) { string newWord = removePuncuation(word); if(!ignoreList.Contains(newWord))
context.EmitKeyValue(newWord, "1"); } }

Simples. not the most elaborate program in the world, but slightly better than my starting point.

I’ve got another, potentially more interesting, program I could use for demos in mind, but I need to grab some (big) data first, watch this space Smile

And again – a note – initially I ran this all from my domain user, and I had issues with accessing the ignoreList file; I’ve reported this and it’s being looked at, but basically there’s a problem for Hadoop (at the moment?) to validate domain users’ permissions.

There were two ways around it – I have uploaded the file from the web interactive console (using fs.put()) and then changed the path in my initalize method (in my case to /user/hadoop/input/ignoreList.txt); I’’m pretty sure that if I had done everything from a non-domain joined account I would not be facing this problem.

About Yossi Dahan
I work as a cloud solutions architect in the Azure team at Microsoft UK. I spend my days working with customers helping be successful in the cloud with Microsoft Azure.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: