BizTalk Enterprise on Azure

With Azure IaaS and BizTalk 2013 officially released I wanted to do a quick exercise of setting up a multi-server BizTalk environment.

I started by creating a virtual network and deployign a domain controller onto it, I then added a SQL Server instance from the platform images and joined it to the domain and finally I added a BizTalk Enterprise instance from the platform images and joined that to the domain too.

So far everything went smooth. I then nearly remembered that in a multi server environment I needed to manually create the necessary domain groups and that was easy enough to do but still – I could not get the BizTalk configuration Wizard to work.

Ok – at least partially this was due to my own arrogance thinking I can still remember how to do this by heart despite having not touched BizTalk in anger for over two years, but partially it is simply the Conifugraion Wizards unhelpful approach to error reporting.

Either way thrawling through the log file revealed the key to finding out what’s missing – the error – “Failed to read "KeepDbDebugKey" from the registry”

A quick Bing search pointed out the infamous mistake – missing the step to enable DTC appropriately.

Quite handily the BTS image comes with DTC correctly configured (no surprise there) but as I was using a vanilla SQL Server image it had DTC disabled, and that needed changing

With that done I could properly configure BizTalk 2013 on an Azure IaaS instance and get everything to work

Mix-n-Match on Windows Azure

One of the powerful aspects of Windows Azure is that we now have both PaaS and IaaS and that – crucially – the relationship between the two is not that of an ‘or’ but rather one of an ‘and’, meaning – you can mix and match the two (as well as more ‘traditional’, non-cloud, deployments, coming to think of it) within one  solution.

IaaS is very powerful, because it is an easier step to the cloud for many scenarios – if you have an existing n-tier solution, it is typically easier and faster to deploy it on Azure over Virtual Machines than it is to migrate it to Cloud Services.

PaaS, on the other hand, delivers much more value to the business, largely in what it takes away (managing VMs).

The following picture, which most have seen, I’m sure, in one shape or form, lays down things clearly –


And so – the ability to run both, within a single deployment if necessary, provides a really useful on-ramp to the cloud; Consider a typical n-tier application with a front end, middle tier and a back end database. The story could be something along these lines –

You take the application as-is and deploy it on Azure using the same tiered approach over VM roles - 


Then, when you get the chance, you spend some time and update your front end to a PaaS Web Role –


Next – you upgrade the middle tier to worker roles –


And finally – you migrate the underlying database to a SQL database –


Over this journey you have gradually increased the value you get from your cloud, on your time frames, in your terms.

To enable communication between the tiers over a private network we need to make both the IaaS elements and the PaaS elements part of the same network, here’s how you do it –

Deploying a PaaS instance on a virtual network

With Virtual Network on Azure the first step is always to define network itself, and this can be done via the Management Portal using a wizard or by providing a configuration file –


The wizard guides you through the process which includes providing the IP range you’d like for the network as well as setting up any subnets as required. It is also possible to point at a DNS and link this to a Private Network – a VPN to a local network. for more details see Create a Virtual Network in Windows Azure.

With the network created you can now can deploy both Virtual Machines and Cloud Services to it.

Deploying Virtual Machines onto the network is very straight forward – when you run the Virtual Machine creation wizard you are asked where to deploy it to and you can specify a region, an affinity group or a virtual network –


If you’ve selected a network, you are prompted to select which subnet(s) you’d like to deploy it to –


and you’re done – , the VM will be deployed to the selected subnet in the selected network, and will be assigned the appropriate IP address.

Deploying Cloud Services to a private network was a little less obvious to me – I kept looking at the management portal for ways to supply the network to use when deploying an instance, and completely ignored the most obvious place to look – the deployment configuration file.

Turns out that a network configuration section has been added with the recent SDK (1.7), allowing one to specify the name of the network to deploy to and then, for each role, specify which subnet(s) to connect it to.

For detailed information on this refer to the NetworkConfiguration Schema, but here’s my example –

<ServiceConfiguration serviceName="WindowsAzure1" xmlns="" osFamily="1" osVersion="*" schemaVersion="2012-05.1.7">
  <Role name="WebRole1">
    <Instances count="1" />
    <VirtualNetworkSite name="mix-n-match" />
      <InstanceAddress roleName="WebRole1">
          <Subnet name="FrontEnd" />

This configuration instructs the platform to place the PaaS instance in the virtual network with the correct subnet, and indeed, when I remote desktop into the instance, I can confirm that it had received a private IP in the correct range for the subnet (, in my case) and – after disabling the firewall on my IaaS virtual machine – I can ping it successfully using its private IP (

It is important to note that the platform places no firewalls between tenants in the private network, but of course VMs may well still have their firewall turned on (the templates we provide do), and so these will have to be configured as appropriate.

And that’s it – I could easily deploy a whole bunch of machines to my little private network – some are provided as VMs, some are provided as code on Cloud Services, as appropriate, and they all play nicely together…

Deploying Joomla on Windows Azure Web Sites

Today we have released the preview of Windows Azure Web Sites to –

Quickly and easily deploy sites to a highly scalable cloud environment that allows you to start small and scale as traffic grows.

Use the languages and open source apps of your choice then deploy with FTP, Git and TFS. Easily integrate Windows Azure services like SQL Database, Caching, CDN and Storage.

Using Web Sites it is very easy to deploy many different OSS based platforms such as Drupal, Joomla!, DNN or WordPress

This simple post will show the steps required to get a Joomla site up and running and with Web Sites this is a wizard driven process that takes around 10 minutes end-to-end. how easy is that?!

To start the process, in the new, HTML5 based, management portal you click the ‘NEW’ button on the bottom left


In the menu that opens you select ‘WEB SITES’ and ‘FROM GALLERY’ to use one of the pre-canned solutions


You could, of course, create your own instance and do anything you’d like on it.

In my case, from the gallery that opens, I select the Joomla! 2.5 item and click the next button


and I’m then asked to provide the details for the deployment. these will, naturally, differ from platform to platform, but they usually follow the same line – url, username, password Smile as well as location for the deployment and database details.

For Joomla! I can select from MySQL (provided through ClearDB or Windows Azure SQL Database), I chose the former, just because-


Next, as I opted for MySQL Database, I’m asked for the details around that


and I’m good to go!

I hit the button and I can see my web site is being created, and then deployed


a couple of minutes later, my web site is running –


and I can see the detailed view –

image well as browse to it –


and, after signing in, editing it –


Connecting to SQL Server on an Azure Virtual Machine

Not surprisingly, one of the first things I’ve done when I got access to the new Virtual Machines capability on Windows Azure is create a VM with SQL Server 2012.I used the gallery image and was up and running in minutes.

The next logical thing was to remote desktop to the machine and play around, which did and I’m glad to report it was boring Smile – everything was exactly as I expected it to be.

Next, juts for fun, I wanted to see whether I could connect to the database engine from my laptop; I knew I won’t be able to use Windows Authentication, so the first thing to do was to create a SQL login on the server and make it an administrator. standard stuff.

I was now ready to connect, so I opened Managemenet Studio on my laptop and tried to connect to (the not so imaginative name I gave my instance), using the SQL login I created –


This sent management studio thinking for a while before coming up with the following error –

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server) (Microsoft SQL Server, Error: 53)

Hmm….looks like a connectivity issue….errr…of course! – Virtual Machines are created by default with only one endpoint – for RDP. port 1433 will be blocked by the firewall on Azure.

Thankfully it is easy enough to add an endpoint for a running instance through the management portal, so I did.
Initially I created one that uses 1433 publically and 1433 privately, but that is not a good idea as far as security is concerned. it would be much better to use a different, unexpected, port publically and map it to 1433 privately and so I ended up using the not-so-imaginative 14333 (spot the extra 3) mapped to 1433.

This adds another layer of security (by obscurity) to my database.

With this setup I tried to connect again, using,143333 as the server name (note the use of ‘,’ instead of ‘:’ which is what I’d initially expected) – only to get a completely different error, this time

Login failed for user ‘yossi’. (.Net SqlClient Data Provider)

Looking at the Application event log on the server (through RDP) I could spot the real problem

Login failed for user ‘yossi’. Reason: An attempt to login using SQL authentication failed. Server is configured for Windows authentication only.

Basically – the server needed to be configured for SQL Authentication (it is configured for Windows Authentication only by default, which is the best practice, I believe)

With this done, and the service restarted I could now connect to the database engine remotely and do as I wish.

(whether that’s a good idea, and in what scenarios that could be useful is questionable, and a topic for another post…)

Somebody had renamed my website!

Last week I got an email from a customer who was surprised to find out that somebody had decided to point a different domain name to their web site (i.e. if there were, somebody pointed at their web site)

We couldn’t quite figure out why would somebody do that, or whether it’s really a problem but it certainly made them feel uncomfortable, and I can see why.

Technically there’s not much one can do to prevent others from doing this, and whilst you can go and complain to the registrar of the rouge domain, this is a hassle and will take some time to sort out, so a technical solution is needed to circumvent that.

The best approach, as far as I can tell, is to set the host name property in the site bindings in IIS to the correct domain name(s), which would result in IIS rejecting any request carrying a different domain name(s) and, indeed, on-premises, this is what everybody seems to do –


Any request made to the web site using a different domain (easily simulated using the hosts file in C:\Windows\System32\drivers\etc), will result in an HTTP 400 or HTTP 503 errors.

To set the host name on a web role instance declaratively one could use the hostHeader attribute of the binding element in the ServiceDefinition.csdef file – this will instruct the fabric to set the value provided in IIS and, as a result, any request made using a different host name will get rejected.

The problem with setting the host name to the production domain is that it would prevent access to the system whilst on staging – when the URL includes a generated quid – the staging URL is not known at design time as as such cannot be provided in the ServiceDefinitions.csdef file.

The solution is to set the site bindings dynamically from within the deployment, and the easiest way to do that is from the OnStart method of the Role –

    public class WebRole : RoleEntryPoint
        public override bool OnStart()
            // For information on handling configuration changes
            // see the MSDN topic at

            catch (Exception ex)
            return base.OnStart();

Before I dive into my FixSiteBindings method I should point out that during the testing of this I’ve used the method pointed out by Christian Weyer to log any exception in OnStart to blob storage, which was very handy!

So – when the role start, FixSiteBindings is called, which looks as follows –

        void FixSiteBindings()
            //web site name is the role instance id with the "_Web" postfix (WebSite name in ServiceDefinition.csdef)
            string webSiteName = RoleEnvironment.CurrentRoleInstance.Id + "_Web";

            using (ServerManager sm = new ServerManager())
                //find web site
                Site site = sm.Sites[webSiteName];
                if (site == null)
                    throw new Exception("Could not find site " + webSiteName);
                //find the binding with hostName TBR - this is the one we need to replace
                Binding b = site.Bindings.FirstOrDefault(binding => binding.Host == "TBR");
                if (b != null)
                    //add a binding with the expected domain - (address:port:hostName),protocol
                    site.Bindings.Add(string.Format(@"{0}:{1}:{2}", b.EndPoint.Address, b.EndPoint.Port, RoleEnvironment.DeploymentId + ""), b.Protocol);


In this code I use ServerManager (Microsoft.Web.Administration) to manipulate IIS settings and add the necessary bindings, but before I go into the code, let me explain the approach I’ve taken –

Ultimately, when my web site is in production, I need a host header for my domain name. I also need a host header with my staging URL, added dynamically.

As I might be using VIP swap between staging and production, I need to have both all the time because the OnStart code (or any start up tasks) will not run during a VIP swap, so I won’t get another chance to make any changes and besides – it is best to do as little as possible between staging and production to keep the system as stable as possible between environments.

Last – I need to ensure that the default binding, without the host header, does not exist, which would prevent others from pointing other domain names at my deployment.

To achieve all of the above I’ve concluded that the easiest way was to start with a ServiceDefinition.csdef file that defines the two bindings I need, use the domain name needed for production and a place holder, with a pre-determined host name for staging, in my case ‘TBR’ –

<?xml version="1.0" encoding="utf-8"?>
<ServiceDefinition name="HostName" xmlns="">
  <WebRole name="MvcWebRole1" vmsize="ExtraSmall">
    <Runtime executionContext="elevated"/>
      <Site name="Web">
          <Binding name="Endpoint1" endpointName="Endpoint1" hostHeader=""/>
          <Binding name="Staging" endpointName="Endpoint1" hostHeader="TBR"/>
      <InputEndpoint name="Endpoint1" protocol="http" port="80" />
      <Import moduleName="Diagnostics" />
      <Import moduleName="RemoteAccess" />
      <Import moduleName="RemoteForwarder" />

When this gets deployed onto Azure I’ve already achieved two of my three requirements – there’s no default binding (as I’ve defined a host name for both endpoints) and I’ve got the production binding fully configured. I also have the beginning of my third requirement as I’ve got a binding for staging, and the known host name makes it easy to find it programmatically, so the last step would be to find that binding and update the host header with the correct value in the role’s OnStart method.

To achieve that I start with figuring out the name of the Web Site in IIS – this will be composed of the current role instance name, with the name of the web site as set in the ServiceDefinition.csdef flie as a postfix –  in my case “Web”.

With an instance of the ServiceManager I find the web site by name and then look for a binding with the host name ‘TBR’ – the one I need to update.

I’m ‘updating’ the binding by removing it and adding one in its place, making sure to use the values from the original one for everything but the host name, which keeps the flexibility of setting these through the ServiceDefinition file.

With the old binding removed and the new one added I commit the changes through the ServerManmager and I’m done – the role should now be set correctly allowing access to both production and staging.

One last thing worth pointing out is that for this code to run it must be run in elevated mode, otherwise trying to make any changes to IIS will result with an error due to lacking permissions; this can be achieved by adding the <Runtime executionContext=”elevated”/> element in the relevant role in the ServiceDefinition.csdef file as is shown above.

It is important to note that this only means that the RoleEntryPoint code will run in elevated mode, and rest of the role’s code will run as normal, which is quite important.

I’ve clearly taken a very specific approach to solve a very specific case.  I could have, for example, iterated over all the instance endpoints from the RoleEnvironment class and added the relevant bindings from that, which would be needed if the site had more than one endpoint; I’m sure that there are many variations for the solution above, but I hope that it provides a nice and easy solution for most and a good starting point for others.

Deploying WordPress to Azure?

I’ve been following the instructions here to deploy a WordPress site to Windows Azure.

The instructions were very detailed clear and the process is relatively painless, although I would recommend installing the SDKs using the web platform installer because it seems that some elements are sensitive to files locations..

For example, when I installed the Azure SDK for PHP I decided, for no good reason, really, to place it under the Windows Azurre SDK folder, so my path was C:\Program Files\Windows Azure SDK\C:\Program Files\Windows Azure SDK for PHP

When I reached the step to create the Azure package –

package create -in="C:\temp\WordPress" -out="C:\temp" -dev=false

I kept getting an error saying cspack.exe could not be found.

I checked and double checked the location of the file vs. the system path, and it all seemed fine, until I inspected the documentation, which suggested the code looks like this –

// Find Windows Azure SDK bin folders
$csPackFolderCandidates = array_merge(
isset($_SERVER[‘ProgramFiles’]) ? glob$_SERVER’ProgramFiles’] . ‘\Windows Azure SDK\*\bin’, GLOB_NOSORT) : array(),
isset($_SERVER[‘ProgramFiles(x86)’]) ? glob($_SERVER[‘ProgramFiles(x86)’] . ‘\Windows Azure SDK\*\bin’, GLOB_NOSORT) : array(),
isset($_SERVER[‘ProgramW6432’]) ? glob($_SERVER[‘ProgramW6432’] . ‘\Windows Azure SDK\*\bin’, GLOB_NOSORT) : array()


if (count($csPackFolderCandidates) == 0) {

            throw new Microsoft_Console_Exception(‘Could not locate the Windows Azure SDK. Download the tools from or using the Web Platform Installer.’);


$cspack = ‘"’ . $csPackFolderCandidates[count($csPackFolderCandidates) – 1] . ‘\cspack.exe’ . ‘"’;

This suggested to me that there were some assumptions made about the possible location of cspack.exe and that somehow my ‘setup’ was not catered for and indeed – after moving the Azure SDK for PHP folder to Program Files directly the package command worked just fine.

Another thing that I’ve learnt is how important it is to make sure the user was created correcly, with a schema that exists or that has the correct permissions – ALTER USER wordpress WITH DEFAULT_SCHEMA = dbo works for me!

Java on Azure

In my conversations with customers on Windows Azure the topic of running Java on Azure often comes up, where I would explain that Java developers can benefit from the PaaS capabilities of Azure just as much as .net developers including the elasticity and ‘self-healing’ properties of our Public Cloud offering whilst avoiding having to maintain VMs (or physical machines)

The best way to get an overview of the experience is to take a look at this 5 minute video as it covers all the steps required to take an existing JSP application in Eclipse and run it on Azure;  if you’d rather read than listen, carry on Smile

Overall, the experience for Java developers is pretty close to the one for .net developers, with a couple of notable differences I will touch upon shortly, so, to start with, let me describe how I got a simple Java application to run on on Azure –

I’m starting with Eclipse IDE for Java EE Developers (Indigo) and already have the Azure SDK installed on my laptop and so the only thing I need to do to prepare my Eclipse environment is to install the Azure plug-in for Eclipse, by using the Help->Install New Software Eclipse menu item and pointing it at

(you can read all about it in the Windows Azure Java Developer Center)

With the IDE ready I create, for example, a new dynamic web project using Tomcat 7 as the server, add to it a JSP page and test it locally. normal stuff. (I can just about do a hello world)

With my ‘elaborate’ web application working locally, it is time to test it in the Azure Emulator and so I create a new ‘Windows Azure Project’ which is now available after installing the plug-in; to get my code included in the Azure project I export the WAR from the dynamic web project to the approot folder in it.

Last, and this is the only real difference between .net and Java with regards to deploying on Azure, is that I need to provide the JDK and the Server to run on the Azure role – .net doesn’t really have the concept of multiple servers, IIS is the de-facto web server , which is automatically included in all web roles. Azure roles also come out of the box with all the .net runtime versions to date; In the Java case developers have a choice of servers they could use and could target one of several JDK versions; for this reason the Azure project needs to include the selected JDK and Server packages and the script required to install them.

This is done by placing, which includes the JDK I wish to use and a zip file containing the server I wish to use – in my case containing apache-tomcat-7.0.22 server I downloaded earlier in the approot folder as well.

Last – I need to provide the role with a script telling it how to install what.

Thankfully – the Azure project template for Eclipse includes sample scripts for the most commonly used servers, namely TomCat 7, Glassfish OSE 3, JBoss AS 6 and 7 and Jetty as well as a custom script that can be expanded, and these takes care of any hassle so all I need to do is copy the contents of the sample script file provided for tomcat 7 to the startup.cmd file in the role and make sure all the file names are correct (WAR, JDK and Server packages)

That done my projects are now ready and after building I can run the RunInEmulator.cmd script, also included in the Azure project template, to deploy the role to the local Compute Emulator – this tests both the script and the application and, once deployed, I am be able to use my application hosted by the emulator-embedded-role.

Happy with that, the last step is to prepare the application to be deployed to the cloud, this is done by changing a property on the Azure project from “Testing in Emulator” to “Deployment to Cloud” and building again – now the project contains the package to be deployed to Azure, the configuration file accompanying it and even a shortcut to the management portal.

I use these in the management portal to initiate a new deployment and several minutes later I’ve got my Apache Tomcat server running my JSP page and no VM in sight!

Of course – with the package on Azure, the platform can deploy this time and time again when scaling up or when an instance fails and needs to be re-deployed.

Result! Smile


One final, temporary note: working on Windows 8 CTP I did find a small problem with the script provided – to avoid problems arising from long paths, the script creates a symbolic link on the root of the drive pointing at the location of the files (somewhere deep under the Eclipse workspace). Windows 8 seems very unhappy with unzipping files into a symbolic link location and the thing breaks. the temporary solution, if you are working on Windows 8, is to remove the step to CD into the symbolic link location. the default location is the approot folder anyway and everything works just fine. the nature of CTPs, I guess.

Setting up my environment to build packages to run on Hadoop on Azure

It shouldn’t have, and I have only myself to blame, but it took some time before I finally figured out what I need to do to setup an environment on my laptop that I could use to build Map/Reduce programs in Java to run on Hadoop on Azure, here’s the set-up I have –

I’ve downloaded and extracted Eclipse Version: 3.6.1 (Helios) from to my Program Files (x86) directory (could have been anywhere, of course)

I then downloaded the Hadoop Eclipse plug-in (hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar) and placed it in the plug-ins folder in Eclipse – I found it here

Following the excellent instructions on the YDN tutorial(despite the versions mismatch) I was able to confirm that the plug-in loads fine and looks as it should, although currently, given the lack of proper authentication with hadoop, hadoop-on-azure does not allow connecting to the cluster from the outside (that would introduce a risk of somebody else connecting to the cluster, as all it takes is guessing the username) which means it was not actually possible for me to connect from the Map/Reduce locations panel to the cluster or indeed through the HDFS node in the project explorer.

It also appears that the plug-in lags behind the core development, and project templates are not up-to-date with the most recent changes to the hadoop classes, but that’s not too much of a problem as there’s not much code in the templates and this can be easily replaced/corrected.

The bit that, due to my lack of experience with Java and Eclipse, took infinitely longer than it should have is figuring out that this is not enough to build a map/reduce project….

Copying the code from the WordCount sample I kept getting errors about most of my imports until I finally figured out what should have been very obvious – I needed hadoop-core- and commons-cli-1.2.jar, the former could be found on  the latter could be found on, although both (and others) also exist in the cluster so I could RDP into it and use skydrive to transfer them over.

That was pretty much it – I could then create a new project, create a new class, paste in the contents of from the sample provided, export the JAR file and use it to submit a new job on hadoop-on-azure

What took me so long?! Smile

Next step would be to be able to test things locally, but I don’t think I’ll go there just yet…

Not seeing your hive data after importing from the marketplace?

In my previous post I wrote about 4 ways to load data onto Hadoop on Azure.

After publishing the post I started to look into a fifth way – importing data from the Windows Azure Datamarket

Hadoop on Azure includes the ability to provide it with credentials to the market place, a query to run and the name of a hive table to create and will do the rest – query the data through the marketplace, store it on HDFS and create a hive table on top.

To configure that – click the ‘Manage Cluster’ tile on the Hadoop on Azure homepage


and then click on the ‘DataMarket’ button


To get to this screen, in which you can provide all the details


You can get (and test) the query from the marketplace’s query builder tool –


After entering all the details and clicking the ‘import data’ button a job will get started, and when completed you will have a hive table with the dataset (you can leave this screen and check back on the job history later, naturally it is all done asynchronously).

The best way to validate that (after of course making sure the job had completed successfully through the job history screen) is to use the hive interactive console – in the Hadoop on Azure homepage click the ‘Interactive Console’ tile and be sure to click the ‘Hive’ button on the top left.

It will take a few seconds for the tables dropdown to get populated, so bare with it, but once it has you should be able to see the table name you’ve entered in the list, and if you do, you should be able to run QL queries on it through the interactive console, I (eventually, see note below) loaded data from the ThreeHourlyForecast table for Heathrow from the met office’s data feed and so I could execute a query such as ‘select * from lhrmetdata’ and see the results displayed in the console. result.

However – with this particular data set I did bump into a bit of a glitch and what is probably a bug in this preview release – when I ran the import data job, the job info page reported the ‘Completed Successfully’ status –


…but the dropdown in the hive interactive console never showed my table, nor did running the ‘show tables’ command.

I poked around the file system on the server (by RDP-ing into it and using the web interface as well as the command line, and I could see the data feed had been downloaded successfully, so I could not figure out what had gone wrong, until ‘jpposthuma’ on HadoopOnAzureCTP Yahoo group provided a spot on advice – to check the downloader.exe log file and so – I’ve opened the MapReduce web console on the server (after RDP-ing into it) and I clicked the log link at the bottom left –


The downloader.exe log file was the first listed in the directory listing


I downloaded the file and opened it in notepad (it is not viewed well in the browser), and the problem became clear immediately (I’ve highlighted the key area) –


2012-04-22 17:28:00,645 INFO  Microsoft.Hadoop.DataLoader.DataLoaderProgram: Start DataLoader …
2012-04-22 17:28:00,708 INFO  Microsoft.Hadoop.DataLoader.DataLoaderProgram: Overwriting flag [-o] is not set
2012-04-22 17:28:00,739 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator: Begin transfer
2012-04-22 17:28:00,739 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator: Transferring schema
2012-04-22 17:28:00,770 INFO  Microsoft.Hadoop.DataLoader.ODataSource: Begin exporting schema
2012-04-22 17:28:00,801 INFO  Microsoft.Hadoop.DataLoader.ODataSource:     build http request to data market:$top=100
2012-04-22 17:28:04,130 INFO  Microsoft.Hadoop.DataLoader.ODataSource: End exporting schema
2012-04-22 17:28:04,130 INFO  Microsoft.Hadoop.DataLoader.FtpChannel: Begin pushing schema
2012-04-22 17:28:05,708 INFO  Microsoft.Hadoop.DataLoader.FtpChannel: Ftp response code: ClosingData
2012-04-22 17:28:05,708 INFO  Microsoft.Hadoop.DataLoader.FtpChannel: End pushing schema
2012-04-22 17:28:05,708 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator: Transferring data
2012-04-22 17:28:05,723 INFO  Microsoft.Hadoop.DataLoader.FtpChannel: Begin pushing data
2012-04-22 17:28:05,786 INFO  Microsoft.Hadoop.DataLoader.ODataSource: Begin exporting data
2012-04-22 17:28:05,786 INFO  Microsoft.Hadoop.DataLoader.ODataSource:     exporting page #0
2012-04-22 17:28:05,786 INFO  Microsoft.Hadoop.DataLoader.ODataSource:     build http request to data market:$top=100
2012-04-22 17:28:06,286 INFO  Microsoft.Hadoop.DataLoader.ODataSource: End exporting data. Total 100 rows exported
2012-04-22 17:28:06,395 INFO  Microsoft.Hadoop.DataLoader.FtpChannel: Ftp response code: ClosingData
2012-04-22 17:28:06,395 INFO  Microsoft.Hadoop.DataLoader.FtpChannel: End pushing data. Total 100 rows pushed
2012-04-22 17:28:06,395 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator: End transfer
2012-04-22 17:28:06,411 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator: Begin creating Hive table
2012-04-22 17:28:06,442 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator: Begin HiveCli execution
2012-04-22 17:28:06,442 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator:     cmd = c:\apps\dist\bin\hive.cmd
2012-04-22 17:28:06,442 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator:     params = -v -f c:\apps\dist\logs\userlogs\hiveql\93e6a3e5-4914-4f78-8731-6bd9f2dcb94d.hql
2012-04-22 17:28:07,911 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator:     [HiveCli stderr] Hive history file=C:\Apps\dist\logs\history/hive_job_log_yossidahan_201204221728_335367629.txt
2012-04-22 17:28:08,333 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator:     [HiveCli stdout] CREATE EXTERNAL TABLE lhrmetdata ( ID BIGINT,ForecastSiteCode INT,PredictionId STRING,SiteName STRING,Country STRING,Continent STRING,StartTime TINYINT,Day STRING,Date STRING,TimeStep SMALLINT,SignificantWeatherId SMALLINT,ScreenTemperature SMALLINT,WindSpeed SMALLINT,WindDirection TINYINT,WindGust SMALLINT,VisibilityCode STRING,RelativeHumidity SMALLINT,ProbabilityPrecipitation SMALLINT,FeelsLikeTemperature SMALLINT,UVIndex SMALLINT,PredictionTime TINYINT ) COMMENT ‘external table to /uploads/lhrmetdata/lhrmetdata/content.dat created on 2012-04-22T17:28:06.411+00:00’ROW FORMAT DELIMITED FIELDS TERMINATED BY ’01’ LOCATION ‘/uploads/lhrmetdata/lhrmetdata’
2012-04-22 17:28:08,551 ERROR Microsoft.Hadoop.DataLoader.DataLoaderMediator:     [HiveCli stderr] FAILED: Parse Error: line 1:163 mismatched input ‘Date’ expecting Identifier near ‘,’ in column specification
2012-04-22 17:28:09,067 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator: End HiveCli execution. Return code = 0
2012-04-22 17:28:09,067 INFO  Microsoft.Hadoop.DataLoader.DataLoaderMediator: End creating Hive table
2012-04-22 17:28:09,083 INFO  Microsoft.Hadoop.DataLoader.DataLoaderProgram: Shutdown DataLoader …


The data feed contained a field named ‘Date’ which is a reserved word and so the parsing of the hive command filed.

However – I now knew I had the feed data stored in HDFS already, and I knew what was wrong, so I could simply execute a slightly modified hive create command changing the column name from Date to TheDate; with the original command provided in the log above this was very easy to figure out –

CREATE EXTERNAL TABLE lhrmetdata ( ID BIGINT,ForecastSiteCode INT,PredictionId STRING,SiteName STRING,Country STRING,Continent STRING,StartTime TINYINT,Day STRING,TheDate STRING,TimeStep SMALLINT,SignificantWeatherId SMALLINT,ScreenTemperature SMALLINT,WindSpeed SMALLINT,WindDirection TINYINT,WindGust SMALLINT,VisibilityCode STRING,RelativeHumidity SMALLINT,ProbabilityPrecipitation SMALLINT,FeelsLikeTemperature SMALLINT,UVIndex SMALLINT,PredictionTime TINYINT ) COMMENT ‘external table to /uploads/lhrmetdata/lhrmetdata/content.dat created on 2012-04-22T17:28:06.411+00:00’ROW FORMAT DELIMITED FIELDS TERMINATED BY ’01’ LOCATION ‘/uploads/lhrmetdata/lhrmetdata’

As expected this command completed successfully and my table now showed in the dropdown list

So – valid reason for failing, was just confusing that the job was reported as successful initially, but I’d expect this to be ironed out before Hadoop on Azure gets released and ultimately – a great way to work with marketplace data!

Loading data onto Hadoop on Azure

I’m fortunate enough to have some time and opportunity to look into Hadoop on Azure and I think is really really cool!

A side effect to something like this is almost always a bunch of random posts of notes I’m taking in the process, and I suspect this won’t be an exception; these are written mainly for my own sake if I’m honest, but hopefully beneficial for others too.

This one is about loading data.

Before Hadoop can analyse data, it needs data, so – how can one load data set onto HDFS on Azure in order to run jobs on it?

The samples provided through the portal include a handy button which allows one click deployment of the files needed to run the sample onto the cluster –


This is very useful as it takes care of all the preparation needed to run the job, which is pretty good when one just wants to see a sample running, but moving on from this – what does one do?

There are several ways to get data onto HDFS, and I bet my list is not complete, but here’s what I’ve experimented with –

To start with – looking at the description of the word count sample, for example – you can find a couple of options –

Using fs.put() command in the interactive console

This will open up a dialog allowing you to chose a local file and specify a destination on HDFS and upload the data for you.


The result is the specified file loaded into HDFS at the specified location (and name)

Use FTPS to upload data

This requires using a tool like curl as secure FTP is needed, and the password needs to be MD5 hashed so I’ve used the powershell script provided with the word count sample to upload the file securely –

$serverName = ""; $userName = "XXUSERNAMEXX"; 
$password = "XXPASSWORDXX"; 
$fileToUpload = "test.txt"; 
$destination = "/user/yossi/test_ftps.txt"; 
$passwordHash ="";
Clear-Variable passwordHash; 
$Md5Hasher = [System.Security.Cryptography.MD5]::Create();
$hashBytes = $Md5Hasher.ComputeHash($([Char[]]$password)) 
foreach ($byte in $hashBytes)
           { $passwordHash += “{0:x2}” -f $byte } 
$curlCmd = "c:\users\yossidah\documents\curl.exe -k --ftp-create-dirs -T $fileToUpload -u $userName" 
$curlCmd += ":$passwordHash ftps://$serverName" + ":2226$destination" 
invoke-expression $curlCmd 
#----- end curl ftps to hadoop on azure powershell example ----

It is worth nothing that by default all ports on the Hadoop cluster are closed, so for this to work you have to open the FTPS port by by clicking on the ‘Open Ports’ tile and opening the FTPS port –


Other two options for uploading files I’ve played with are

Using the Hadoop command line

If you can get the file ontop the head node (I’ve downloaded it from my skydrive account, for example), you can use the command line hadoop fs –copyFromLocal to load the file onto HDFS, but frankly this seems more trouble then its worth given the previous two options

Load data directly from Azure Storage

This is much more interesting – under the ‘Manage Cluster’ tile you can find an option to ‘Set up ASV’ or ‘Set up S3’

This lets you configure Hadoop with credentials to the storage account in the relevant cloud platform and this lights up two options –

  1. You can now use hadoop fs -cp to copy a file from Azure Storage to HDFS using the ASV:// or s3/s3n monikers for the source file.
  2. You can actually run a job directly with the data on the cloud blob and even write the result back to a blob, again – using the relevant moniker, for example – hadoop.cmd jar hadoop-examples- wordcount asv://foo/input asv://foo/output

So – 4 nice and easy ways to get data to Hadoop on Azure to get started

%d bloggers like this: