Implementing Pig streaming extension in c#

The next step after playing with basic Pig on HDInsight was to look at what it takes to bring c# to the picture.

Pig allows creating User Defined Functions (UDF) in various languages but unfortunately .net isn’t currently supported.

On the other hand, though, Pig also supports streaming, and we already know that .net plays very nicely in this context, so – how does that work and what can I do with it?

Looking back at my METAR processing sample, let’s imagine that we wanted to use Pig to query the results and the wanted to produce the average temperature per airfield in a given result set. sounds good?

Well – using standard Pig script we can produce the result set we want and then, given a variable RESULT, we could invoke a custom c# program on in using the following syntax (note: pay attention to the direction of the ‘tick’, it caught me!) :

grunt> AVERAGE = stream RESULT  through `c:\\test\\pigStreamSample.exe`;

grunt> dump AVERAGE;

In real life I would probably use DEFINE to alias the definition of the extension and use SHIP to point at the location of the executable  –

grunt> define average `pigStreamSample.exe` ship (‘c:\\test\\pigStreamSample.exe’);

grunt> B = stream A through average;

grunt> dump B;

the output of this would look as follows –

image

My initial result set contained 144 METAR records for 3 airfields, these have been averaged to produce 3 records using a simple c# program, but this program could have been doing anything. nice!

How does pigStreamSample.exe look like? well – it’s not pretty, but it’s good enough to demonstrate the work required –

class Program
    {
        static void Main(string[] args)
        {
            Stream stdin = Console.OpenStandardInput();
            Stream stdout = Console.OpenStandardOutput();
            StreamWriter outputWriter = new StreamWriter(stdout);
            StreamReader inputReader = new StreamReader(stdin);
            
            List<Result> results = new List<Result>();

            string line = inputReader.ReadLine();
            while (line != null)
            {
                //parse the input as received
                METAR metar = parseLine(line);
                //find if ICAO code already 'known'
                Result result = results.FirstOrDefault(r=>r.ICAO==metar.ICAO);
                if (result == null) //if ICAO code not alreayd known create a new one
                {
                    result = new Result(metar.ICAO);
                    results.Add(result);
                }
                result.count++;//increment count
                result.sum+=metar.Temperature;//add sum

                //outputWriter.WriteLine(line);
                line = inputReader.ReadLine();
            }

            foreach (Result r in results)
                outputWriter.WriteLine(string.Format("{0}\t{1}", r.ICAO, r.sum / r.count));//output the average per ICAO

            inputReader.Close();
            inputReader.Dispose();
            outputWriter.Close(); 
            outputWriter.Dispose();
        }

        private static METAR parseLine(string line)
        {
            string[] parts = line.Split('\t');
            METAR metar = new METAR();
            metar.ICAO = parts[0];
            metar.Date = parts[1];
            metar.Cloudbase = int.Parse(parts[2]) * 100;
            metar.Temperature = int.Parse(parts[3]);
            return metar;
        }
        
    }

with METAR being a simple class

    class METAR
    {
        public string ICAO { get; set; }
        public string Date { get; set; }
        public int Cloudbase { get; set; }
        public int Temperature { get; set; }
    }

and Result even simpler –

    class Result
    {
        public Result(string ICAO)
        {
            this.ICAO = ICAO;
            this.count = 0;
            this.sum = 0;
        }
        public string ICAO { get; set; }
        public int count { get; set; }
        public int sum { get; set; }
    }

About Yossi Dahan
I work as a cloud solutions architect in the Azure team at Microsoft UK. I spend my days working with customers helping be successful in the cloud with Microsoft Azure.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: