Pig on HDInsight Server
November 6, 2012 Leave a comment
@Slodge had prompted me to look in more detail into running Pig on HDInsight.
In that previous post I described how I took semi-structured METAR reports, ran a M/R program on them to extract the cloudbase and temperature and then create a hive table on top; in this post I’ll use some basic Pig to examine the data in the hive table and extract the 10 reports with the highest cloudbase.
To get started I open a hadoop command shell and browse to c:\Hadoop\pig-0.9.3-SNAPSHOT\bin
I then run the pig.cmd which takes me to the grunt> prompt
to start with, I’ll simply read the contents of the table (as it’s not too big at this point) –
grunt>everything = LOAD ‘metarsoutput’;
and I get a bunch of results, here’s an extract –
To work with the results better I could provide details about the schema –
grunt>everything = LOAD ‘metarsoutput’ as (icao, datetime, cloudbase: int, temperature: int);
grunt> describe everything;
still produces the same results as before, but now I can ask the records to be sorted –
grunt> sorted = order everything by cloudbase desc;
grunt> dump sorted;
which produces a nicely ordered results –
I can also limit the number of records I want to get back –
grunt> top = limit sorted 10;
grunt> dump top;
Ok –so all of these are pretty basic examples, but as such show the basic operation of Pig on HDInsight.
To find out some more about what’s possible with PIG take a look here
@Slodge actually asked me about being able to run custom functions (UDFs) for Pig in c#, which is not currently possible, but Pig does support streaming, and that should provide a handy way ‘in’, which I’ll try to look at next.