Pig on HDInsight Server

@Slodge had prompted me to look in more detail into running Pig on HDInsight.

I’ve played with it the past using the interactive javascript console and the word count sample, but haven’t gone into much more detail and so I thought it would be nice to do something at the back of my ‘fancy’ aviation weather report using Pig and Hadoop on Windows.

In that previous post I described how I took semi-structured METAR reports, ran a M/R program on them to extract the cloudbase and temperature and then create a hive table on top; in this post I’ll use some basic Pig to examine the data in the hive table and extract the 10 reports with the highest cloudbase.

To get started I open a hadoop command shell and browse to c:\Hadoop\pig-0.9.3-SNAPSHOT\bin

I then run the pig.cmd which takes me to the grunt> prompt

image

to start with, I’ll simply read the contents of the table (as it’s not too big at this point) –

grunt>everything = LOAD ‘metarsoutput’;

grunt>dump everything;

and I get a bunch of results, here’s an extract –

image

To work with the results better I could provide details about the schema –

grunt>everything = LOAD ‘metarsoutput’ as (icao, datetime, cloudbase: int, temperature: int);

grunt> describe everything;

produces –

image

whilst

dump everything;

still produces the same results as before, but now I can ask the records to be sorted –

grunt> sorted = order everything by cloudbase desc;

grunt> dump sorted;

which produces a nicely ordered results –

image

I can also limit the number of records I want to get back –

grunt> top = limit sorted 10;

grunt> dump top;

image

 

Ok –so all of these are pretty basic examples, but as such show the basic operation of Pig on HDInsight.

To find out some more about what’s possible with PIG take a look here

@Slodge actually asked me about being able to run custom functions (UDFs) for Pig in c#, which is not currently possible, but Pig does support streaming, and that should provide a handy way ‘in’, which I’ll try to look at next.

About Yossi Dahan
I work as a cloud solutions architect in the Azure team at Microsoft UK. I spend my days working with customers helping be successful in the cloud with Microsoft Azure.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: