From Zero to Map Reduce in .net on Windows in 10 minutes

Back in May I posted about what it took to create a development environment to be able to build map/reduce programs for Hadoop on Azure (HoA), now officially Azure HDInsight Service

Yesterday Microsoft published the Microsoft .net SDK for Hadoop which makes it easier to build map/reduce jobs in .net. to me (and many other developers, I suspect) this makes it all the more approachable, which is awesome!

To begin with – it means being able to use Visual Studio and not having to have (the correct version of), which to me is clearly a great advantage, but it gets better as the SDK, like many others these days, was made available through NuGet, so – to be able to develop my map/reduce program in VS I simply open it and in the package manager type

install-package Microsoft.Hadoop.MapReduce

image

Doing so had added a few things to my project – I can now see a reference to Microsoft.Hadoop.MapReduce and Newtonsoft.Json as well as an additioanl ‘MRLib’ folder containing several useful resources –

image

the added namespace provides, amongst other things, a MapperBase class I could use to define my mapper with a Map function for me to override –

 public class WordCountMapper : Microsoft.Hadoop.MapReduce.MapperBase
    {
        public override void Map(string inputLine, MapperContext context)
        {
            throw new NotImplementedException();
        }
    }

and so, if I was to implement the Hadoop’s equivalent to Hello World, the infamous WordCount, I would do something along the lines of –

public override void Map(string inputLine, MapperContext context)
        {
            string[] words = inputLine.Split(' ');
            foreach (string word in words)
                context.EmitKeyValue(word, "1");
        }

The reduce would then look something like –

 public class WordCountReducer : ReducerCombinerBase
 { 
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
    {
              context.EmitKeyValue(key, values.Count().ToString());//each value is always one, so count is as good as sum 
    }
 }

To run this I can create a job configuration –

 public class WordCountJob : HadoopJob<WordCountMapper,WordCountReducer>
        {
        public override HadoopJobConfiguration Configure(ExecutorContext context)
        {
            HadoopJobConfiguration config = new HadoopJobConfiguration();
            config.InputPath = context.Arguments[0];
            config.OutputFolder = context.Arguments[1];
            return config;
        }
    }

At this point I was going to try and run this on Windows Azure, but then I figured – what the heck – I’ll run this on my laptop, after all – HDInsight Server is on the web platform installer –

image

after installing it I can see the services running –

image

and I can see a couple of web sites –

image

So – with Hadoop now installed, I need to get some files into it –

I’ll start put uploading to HDFS a book from Gutenberg by opening the hadoop command shell (shortcut on the desktop goes to “C:\windows\system32\cmd.exe /k pushd “c:\hadoop\hadoop-1.1.0-SNAPSHOT” && “c:\hadoop\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd””

and the use the following command to create a books folder

hadoop fs –mkdir books

and the following command to upload a file to that folder –

hadoop fs -copyFromLocal "<path>\Ulysses.txt" books/ulysses.txt

I could verify that the file has been uploaded to hdfs using

hadoop fs -ls books

and ran the job using MRRunner providing the two parameters – input folder and output filename

mrrunner –dll  WordCountSample.dll -- books output

With that done, I could verify my results using

Hadoop fs -cat output/part-00000

Admittedly this is only scratching the surface, and a very basic sample. a slightly more elaborate one to follow, my intention here was really just to show how easy it is to get started with Hadoop and .net on Windows, and I hope this point was made….

Note – for some reason, when I tried this in the office Hadoop insisted on using an unfamiliar host name when accessing log file and the results of the map phase making the job get ‘stuck’ at 100% map and 0% reduce. I could work around it initially by adding a host file entry point this unknown domain to 127.0.0.1, but this morning I learnt from a colleague that Hadoop does a reverse DNS lookup to find the name node and job tracker and it just happened to find a more responsive machine on the network than mine! rebooting my machine last night at home prevented this from happening and everything worked very smoothly

About Yossi Dahan
I work as a cloud solutions architect in the Azure team at Microsoft UK. I spend my days working with customers helping be successful in the cloud with Microsoft Azure.

3 Responses to From Zero to Map Reduce in .net on Windows in 10 minutes

  1. Robin says:

    Hi, I get this error upon executing mrrunner.exe

    ERROR: More than one class in DLL derives from MapReduceJob. Specify the intended class explicitly.

    Can you help?

    • Yossi Dahan says:

      I think so.
      I’m guessing you’re using .net and MRRunner.exe – this will reflect on your assembly and try to find out the MapReduceJob implementation. if there’s only one everything’s fine, if there are multiple ones you are going to have to let it know which one to use. you can use the following format
      mrrunner -dll .dll -class —

      I hope this makes sense

      Yossi

  2. duygu says:

    Hi,I am unable to add the project reference reducercombinerbase.How to add?
    and I will make the graduation project.I do with hadoo on azure.but I don’t know where to start.I use yandex map api in my project.About this subject,do you offer a source of?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: