From Map/Reduce to Hive (and Power View) using HDInsight

Whilst Map/Reduce is very powerful in processing unstructured data, users (and most applications) still prefer to handle with structured data in familiar ways, and this is where the hive support and the HDInsight ODBC provider comes in very handy.

One could use Map/Reduce to process  un/semi-structured data into structured data in files which can then be exposed, through, hive, as tables to external systems.

I wanted to demonstrate an end to end scenario, but one that is simple enough not to cloud the principles, and with my love for all things aviation I thought I’d look at aviation weather reports – METARS.

As input I’ve downloaded a bunch of current reports for London’s Heathrow (EGLL), Gatwick (EGKK) and Stansted (EGSS) airports; these come in as strings that look like  –

EGLL 280820Z 21006KT 180V260 9999 SCT033 04/02 Q1017 NOSIG

Given the nature of the beast – METAR data format does have rules but a) they are quite flexible, b) they are not always followed to the letter – Map/Reduce would be very useful to extract the relevant information from the fairly flexible input format. thankfully I already built (for a Windows 8 app I’m working on) a library that parses METARs, so I could use that in my mapper (oh! the benefits of being able to use .net for M/R jobs!)

As an example I’ve decided to create a report to demonstrate the change of the cloud base and temperature over a particular airport over time (in this example there’s one layer of scattered clouds in 3,300 feet, represented by the SCT033 string and temperature of 04 in 04/02), but of course this can get as complicated as one wants it to be…

The idea is to use a mapper to convert this semi-structured format to a know format one of, say

[ICAO Code] \t [observation date/time] \t [cloudbase in feet] \t [temperature]\n\r

With this more structured format I could create a hive definition on top of it and consume that from, for example, Excel via the ODBC driver.

Let’s see what it takes –

The first step is the M/R layer – in this case I do not really need a reducer-combiner as I have no aggregation to do, I simply want to convert the source data to a more structure format, and that’s what the mapper is all about.

In .net I’ll create the following Mapper class –

    public class METARMap : MapperBase
    {
        public override void Map(string inputLine, MapperContext context)
        {
            context.Log("Processing " + inputLine);
            //my metar files have each two lines - first line is date in the format 2012/10/28 12:20
            //second line starts with ICAO code; I need to ignore the lines with the date, 
            //this will do for the next 988 years or so 
            if (!inputLine.Trim().StartsWith("2"))
            {
                Aviator.METAR.DecodedMETAR metar  = Aviator.METAR.DecodedMETAR.decodeMETAR(inputLine);
                context.EmitLine(string.Format("{0}\t{1}\t{2}\t{3}",
                    metar.ICAO,
                    calcObservationDateTime(metar), 
                    metar.Cloud.Count > 0 ? metar.Cloud[0].Height.ToString() : null,
                    metar.Temprature));
            }
        }

The METAR decoding logic is irrelevant here really, the important piece is that in the .net SDK, alongside the EmitKeyValue function of Context, you can also find EmitLine which gives you full control on the structure of the line emitted; in this case I chosen to stick to the tab-delimited approach, but added additional values to the ICAO code key. (calcObservationDateTime is a function that returns a date/time value based on the first portion of the METAR (280820Z means 28th day of current month and year, at 08:20 UTC)

the result of the map for the input I’ve provided above is

EGLL 28/10/2012 09:20 33 02

now – I did say I did not really need a reducer-combiner, and that is true, but as my input comes in many small files, with just a map, the output will also be created as many small files, so I created  a simple combiner to bring them together  – it doesn’t really do anything – it get’s the ICAO code as a key and the output from the map (all the fields, tab delimited) as a single value in the array, so it looks over the array emitting the key and each value separately, but now to a single file

    public class MetarReducer : Microsoft.Hadoop.MapReduce.ReducerCombinerBase
    {
        public override void Reduce(string key, IEnumerable<string> values, Microsoft.Hadoop.MapReduce.ReducerCombinerContext context)
        {
            foreach (string value in values)
                context.EmitKeyValue(key, value);
        }
    }

Either way – single file for all metars or many files in a folder, the result are in a consistent format, with only the data I need,  so I can now create a hive external table using the following statement in the hive interactive console –

create external table metars(icao string, obs_datetime string,cloudbase int, temperature smallint) row format delimited fields terminated by ‘\t’ stored as textfile location ‘/user/yossidah/metarsoutput’

which in turn allows me to query the table –

select * from metars

and get  –

EGKK 28/10/2012 09:20 30 5

EGLL 28/10/2012 11:20 15 8

Now I can use the hive add-in to excel and read that data –

image

..and if it’s in Excel it can be in any other data-based system, including PowerPivot and Power View, here’s one with a bit more data (3 airfields, 6 METARS for each) –

image

And so there you go – from unstructured to Power View, all on Windows and in .net Smile

Can’t get SSAS databases to appear in Performance Point Dashboard Designer? Check you ADOMD.net version!

I’ve been working over the last couple of days on creating a SharePoint farm consisting of a SharePoint 2013 Server backed by a SQL 2012 database with a Windows 8 Client machine.

Overall the experience has been very good and I’m really liking the new look and feel for SharePoint and the new BI capabilities introduced in conjunction with SQL 2012 SP1.

Trying to create a dashboard using the PerformancePoint dashboard designer, however, I could not get the database dropdown to populate with from my SSAS instance, it just kept coming back empty

DashboardDesignerError

A bit of scratching of the head eventually brought me to look at the SharePoint Server’s event log where this was being logged –

The data source provider for data sources of type ‘ADOMD.NET’ is not registered. Please contact an administrator.

PerformancePoint Services error code 10115.

I expected this to have been installed already on the server, but clearly it was not, so I quickly made my way to Bing and looked up ADOMD.net which led me to http://www.microsoft.com/en-us/download/details.aspx?id=23089 to download and install the latest version. only that it didn’t help.

To be sure I’m picking up the latest I restarted the server (restarting the application pools would have been enough, but I went for the whole hog); this proved useful because although the server kept logging the same warning as before, shortly after starting it also logged the following error –

Unable to load custom data source provider type: Microsoft.PerformancePoint.Scorecards.DataSourceProviders.AdomdDataSourceProvider, Microsoft.PerformancePoint.Scorecards.DataSourceProviders.Standard, Version=15.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c

System.IO.FileNotFoundException: Could not load file or assembly ‘Microsoft.AnalysisServices.AdomdClient, Version=10.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91’ or one of its dependencies. The system cannot find the file specified.
.

.

PerformancePoint Services error code 10107.

This explained better what was going on – I probably had the ADOMD installed, but I needed a very specific version of it, which turns out to be the SQL 2008 one from April 2009, which can be found here

After I’ve installed this one (and restarted the application pool) my SSAS databases appear as expected –

image

%d bloggers like this: