Reflection on HDInsight
October 30, 2012 Leave a comment
After a few days of ‘playing’ with HDInsight, both server and service, it was time to think back and take stock.
I had no exposure to Hadoop before I started playing with it on Azure a few months back, and I am still, by all accounts, a complete novice, but having spent the last few days on HDInsight there are a few interesting observations I thought I’d share –
Lowering the barrier of entry
Personally, and this is a very subjective view point, this is perhaps the greatest wow factor of all – until now, if I wanted to “get into the game” with Hadoop, I had to get comfortable with Java (nothing against it, but I’m not), and I had to have Eclipse, and everything had to be just right.
Iif I wanted to run anything locally, I had to get comfortable with running Hadoop in Cygwin and all sort of things (but frankly – tests clusters on Azure have been a great experience, more on that shortly)
Now – with HDInsight Server I can install Hadoop on Windows with a click from the web platform installer; I can get the latest of our distribution of Hadoop (currently in preview) on my laptop, in minutes, with zero config.
I can then use Visual Studio and .net, both I’m very familiar with, to do pretty much all the development I need, I no longer have Eclipse and I don’t really need to use Java.
This is bound to significantly lower the barrier of entry to handling big data for a lot of people. is this the beginning of Hadoop for the masses?
The other thing that became apparent very quickly as I was building various scenarios in my tests, is that now that I’m developing in .net I can not only build on all the knowledge and experience I’ve accumulated over the years, I can actually build on a lot of code I already have.
When I processed the METAR data – I already had the parser I developed for my Windows 8 application, I did not need to re-write anything, it just slotted in.
Speaking to a few architects and developers in the last week or so these two points resonate very well – so many people want to get into the thick of things but are somewhat intimidated (as I was) or, quite simply, the cost of implementations is too high.
Choice between cloud and on-premise
When I started, I used the hadoponazure.com exclusively, because I did not want to run Hadoop on Cygwin, and I did not have access to our server deployment. Now everyone, pretty much, have access to both. This choice is quite powerful – I can see people like me running development instances locally to benefit from quick coding iterations, but running production clusters in the cloud, benefitting from the cost effectiveness of scaling in the cloud, not to mention on-demand clusters that can be removed when the processing is done.
I could also easily imagine the reverse – teams using the cloud as a test bed, on test data, before running on a production cluster, avoiding the need to maintain several environments on-premises.
Sticking with the community –
Hadoop has a large eco-system, and it’s great that we can remain part of that. many projects are being worked on to stabilize and improve on Windows, with others to come in the future, but it seems that many ‘just work’ even now. it is simply great that the decision was taken not to re-invent the wheel here.
So far I’ve been using HDFS and Map/Reduce, but also Hive quite significantly and a bit of Mahout, and I know others have been trying out Oozie and other projects on it too.
I’ve been browsing the Apache repositories a little bit and it seems to me that the contributions made are very well received and go beyond benefiting just those who chose to run Hadoop on Windows, and that’s great too!
Connecting with the rest of the stack
But of course, technology is here to serve a purpose, and when it comes to data, there are already protocols and interfaces that systems and users use very effectively.
People, fundamentally, don’t want to change they consume data – big or small – they want to be able to leverage Hadoop but remain in the familiar tools and technologies such as SQL Server with Analysis Services, Reporting Services, Power View and tools such as Excel with Pivot Tables, data mining plug-ins etc.
Of course there are many other benefits, I’m being really selfish here and put down the 4 that struck me most for my immediate needs. I suspect many organisations will value the ability to run on Windows with all the management story that comes with it (whilst others, I’m sure, won’t care), and there are some very important capabilities that still need covering – Active Directory integration, for example, for better security or System Centre integration for better monitoring. but these are for another day