Loading data onto Hadoop on Azure

I’m fortunate enough to have some time and opportunity to look into Hadoop on Azure and I think is really really cool!

A side effect to something like this is almost always a bunch of random posts of notes I’m taking in the process, and I suspect this won’t be an exception; these are written mainly for my own sake if I’m honest, but hopefully beneficial for others too.

This one is about loading data.

Before Hadoop can analyse data, it needs data, so – how can one load data set onto HDFS on Azure in order to run jobs on it?

The samples provided through the portal include a handy button which allows one click deployment of the files needed to run the sample onto the cluster –

image

This is very useful as it takes care of all the preparation needed to run the job, which is pretty good when one just wants to see a sample running, but moving on from this – what does one do?

There are several ways to get data onto HDFS, and I bet my list is not complete, but here’s what I’ve experimented with –

To start with – looking at the description of the word count sample, for example – you can find a couple of options –

Using fs.put() command in the interactive console

This will open up a dialog allowing you to chose a local file and specify a destination on HDFS and upload the data for you.

image

The result is the specified file loaded into HDFS at the specified location (and name)

Use FTPS to upload data

This requires using a tool like curl as secure FTP is needed, and the password needs to be MD5 hashed so I’ve used the powershell script provided with the word count sample to upload the file securely –

$serverName = "XXSERVERNAMEXX.cloudapp.net"; $userName = "XXUSERNAMEXX"; 
$password = "XXPASSWORDXX"; 
$fileToUpload = "test.txt"; 
$destination = "/user/yossi/test_ftps.txt"; 
$passwordHash ="";
Clear-Variable passwordHash; 
$Md5Hasher = [System.Security.Cryptography.MD5]::Create();
$hashBytes = $Md5Hasher.ComputeHash($([Char[]]$password)) 
foreach ($byte in $hashBytes)
           { $passwordHash += “{0:x2}” -f $byte } 
$curlCmd = "c:\users\yossidah\documents\curl.exe -k --ftp-create-dirs -T $fileToUpload -u $userName" 
$curlCmd += ":$passwordHash ftps://$serverName" + ":2226$destination" 
invoke-expression $curlCmd 
#----- end curl ftps to hadoop on azure powershell example ----

It is worth nothing that by default all ports on the Hadoop cluster are closed, so for this to work you have to open the FTPS port by by clicking on the ‘Open Ports’ tile and opening the FTPS port –

image

Other two options for uploading files I’ve played with are

Using the Hadoop command line

If you can get the file ontop the head node (I’ve downloaded it from my skydrive account, for example), you can use the command line hadoop fs –copyFromLocal to load the file onto HDFS, but frankly this seems more trouble then its worth given the previous two options

Load data directly from Azure Storage

This is much more interesting – under the ‘Manage Cluster’ tile you can find an option to ‘Set up ASV’ or ‘Set up S3’

This lets you configure Hadoop with credentials to the storage account in the relevant cloud platform and this lights up two options –

  1. You can now use hadoop fs -cp to copy a file from Azure Storage to HDFS using the ASV:// or s3/s3n monikers for the source file.
  2. You can actually run a job directly with the data on the cloud blob and even write the result back to a blob, again – using the relevant moniker, for example – hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar wordcount asv://foo/input asv://foo/output

So – 4 nice and easy ways to get data to Hadoop on Azure to get started

About Yossi Dahan
I work as a cloud solutions architect in the Azure team at Microsoft UK. I spend my days working with customers helping be successful in the cloud with Microsoft Azure.

One Response to Loading data onto Hadoop on Azure

  1. Agreed Yossi, really great platform. Tried out the CTP build. Was really satisfied to see all the features working.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: