The Daily Parker

Politics, Weather, Photography, and the Dog

Almost done with Weather Now's upgrade

I've just finished—I mean, finished—the Weather Now worker role. The worker role runs in the background and performs tasks like, for example, downloading the weather from outside sources, parsing it, and storing it.

I have three tasks left to enable me to publish the new version of Weather Now to its new home in Windows Azure:

  • Create a script to initialize the lists that appear on the site's home page;
  • Upgrade the existing ASP.NET website to an ASP.NET web application; and
  • Create an Azure Cloud Service Web role to house the application.

I believe I will be done today sometime. But first, Parker is demanding a trip outside.

March? What do you mean, March?

I'm just a day from losing my mind (or "loosing," to all you Facebookers out there), a day from my workload returning to normal levels, and a day from deploying Weather Now to a test instance in Azure. Then, maybe, I'll have time to take all these in:

Watch this space for a sneak preview of Weather Now 4.0, possibly tomorrow. The GetWeather utility has run with only minor hitches for a week, and with two more (quick) bug fixes it's ready for production. That just leaves about 6 hours of work to move the ASP.NET application up to Azure...and then, you get to beta test it. If all goes well I'll cut over to Azure on the 9th or 10th, and finally—finally!—retire my last two servers.

The very bad week of Microsoft Windows Azure

Microsoft has suffered some unfortunate outages this week, first affecting SQL databases on Monday, and then yesterday storage:

On Friday, February 22 at 12:44 PM PST, Storage experienced a worldwide outage impacting HTTPS traffic due to an expired SSL certificate. This did not impact HTTP traffic. We have executed repair steps to update SSL certificate on the impacted clusters and have recovered to over 99% availability across all sub-regions. We will continue monitoring the health of the Storage service and SSL traffic for the next 24 hrs. Customers may experience intermittent failures during this period. We apologize for any inconvenience this causes our customers.

The outage caused problems throughout the Azure universe, because SSL-based storage underpins just about everything. Without Storage, for example, any VM that goes offline can't restart, because its VHD is kept in Storage. Web sites and Service Bus were also hosed. My customers were annoyed.

These problems can affect any computing system. The problem with Azure Storage going down was the scope of it: millions of applications. Even the largest colo data center only has tens of thousands of computers. With so many people affected, the outage looks like a disaster.

I'll be watching Microsoft closely over the next few days to see what more they can tell us about the outage. But if this was all do to certificates expiring, wow.

Two Microsoft Azure outages in 24 hours

Over the past two days, Microsoft Azure had two outages they're still investigating. The first, from 18:26 CST through 20:00 CST Monday (0026 to 0200 UTC Tuesday), and the second, from 13:50 to 15:27 CST (1950-2127 UTC) yesterday, affected SQL Database and related services in the Azure datacenter outside Washington, D.C.

I noticed the Monday evening outage as it happened, because when a database goes down, a number of applications start sending me emails. A couple of people had minor inconveniences, but as it happened on a holiday evening, the damage wasn't too severe.

I did not notice the Tuesday afternoon outage, which did affect a lot of people and made some of my clients very angry, because I was on an airplane. When I landed and turned on my phone, I had 300 emails from various applications and mercifully only 4 from angry clients. (Welcome home!)

Microsoft hasn't determined reported the cause yet, but given the maintenance they had planned, started, and then backed out on Sunday night, they may have a clue. They have a second round of maintenance planned for tonight at midnight CST (0600 UTC). I'll be watching carefully tomorrow morning.

When the Azure emulator is more forgiving than real life

Last night I made the mistake of testing a deployment to Azure right before going to bed. Everything had worked beautifully in development, I'd fixed all the bugs, and I had a virgin Windows Azure affinity group complete with a pre-populated test database ready for the Weather Now worker role's first trip up to the Big Time.

The first complete and total failure of the worker role I should have predicted. Just as I do in the brick-and-mortar development world, I create low-privilege SQL accounts for applications to use. So immediately I had a bunch of SQL exceptions that I resolved with a few GRANT EXEC commands. No big deal.

Once I restarted the worker role, it connected to the database, loaded its settings, downloaded a file from NOAA and...crashed:

Inner Drive Weather threw System.Data.Services.Client.DataServiceRequestException
...
OutOfRangeInput

One of the request inputs is out of range.
RequestId:572bcfee-9e0b-4a02-9163-1c6163798d60
Time:2013-02-10T06:05:41.5664525Z

at System.Data.Services.Client.DataServiceContext.SaveResult.d__1e.MoveNext()

Oh no. The dreaded Azure Storage exception that tells you absolutely nothing.

Flash forward fifteen minutes (now past midnight; and for context, I'm writing this on the 9am flight to Los Angeles), with Fiddler running on a local instance connecting to production Azure storage, and I found the XML block on which real Azure Storage barfed but the Azure storage emulator passed without a second thought. The offending table entity is metadata that the NOAA downloader worker task stores to let the weather parsing worker task know it has work to do:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
   <entry xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" 
   xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" 
   xmlns="http://www.w3.org/2005/Atom">
  <title />
  <author>
    <name />
  </author>
  <updated>2013-02-10T05:55:49.3316301Z</updated>
  <id />
  <content type="application/xml">
    <m:properties>
      <d:BlobName>20130209-0535-sn.0034.txt</d:BlobName>
      <d:FileName>sn.0034.txt</d:FileName>
      <d:FileTime m:type="Edm.DateTime">2013-02-09T05:35:00Z</d:FileTime>
      <d:IsParsed m:type="Edm.Boolean">false</d:IsParsed>
      <d:ParseTime m:type="Edm.DateTime">0001-01-01T00:00:00</d:ParseTime>
      <d:PartitionKey>201302</d:PartitionKey>
      <d:RetrieveTime m:type="Edm.DateTime">2013-02-10T05:55:29.1084794Z</d:RetrieveTime>
      <d:RowKey>20130209-0535-41d536ff-2e70-4564-84bd-7559a0a71d4d</d:RowKey>
      <d:Size m:type="Edm.Int32">68202</d:Size>
      <d:Timestamp m:type="Edm.DateTime">0001-01-01T00:00:00</d:Timestamp>
    </m:properties>
  </content>
</entry>

Notice that the ParseTime and Timestamp values are equal to System.DateTimeOffset.MinValue, which, it turns out, is not a legal Azure table value. Wow, would it have helped me if the emulator horked on those values during development.

The fix was simply to make sure that neither System.DateTimeOffset.MinValue nor System.DateTime.MinValue ever got into an outbound table entity, which took me about five minutes to implement. Also, it turned out that even though my table entity inherited from TableServiceEntity, I still had to set the Timestamp property when using real Azure storage. (The emulator sets it for you.)

By this point it was 12:30 and I needed to get some sleep, however. So my plan to run an overnight test will have to wait until this evening at my hotel. Then I'll find the other bits of code that work fine against the emulator but, for reasons that pass understanding, the emulator gets completely wrong.

Under the hood of Weather Now

My my most recent post mentioned finishing the GetWeather component of Weather Now, my demo project that provides near-real-time aviation weather for most of the world. I thought some readers might be interested to know how it works.

The GetWeather component has three principal tasks:

In the Inner Drive Technology world, an Azure worker process uses an arbitrary collection of objects that implement the IWorkerTask interface. The interface defines Interval and LastRun properties and an Execute method, which is all the worker process needs to know. The tasks are responsible for their own lifespans, reentry prevention, etc. (That's another discussion.)

In order to decouple the data source (NOAA now, other sources in the future) from the application, I split the three tasks into two IWorkerTask classes:

  • The NoaaFileDownloadingWorkerTask opens an FTP connection to the NOAA public weather servers, retrieves the files it hasn't already retrieved, and stores the contents in Azure Blob Storage; and
  • The NoaaFileParsingWorkerTask pulls the files out of Azure Storage, parses them, and stores the results in an Azure SQL Database and Azure table storage.

I'm using Azure storage as an intermediary between the two sides of the process because my analysis led me to the conclusion that they're really independent of each other. Coupling of the two tasks in the current (2002) version of GetWeather causes all kinds of problems, not least that a failure in one task can stop the whole thing. If, as happens given the nature of the Internet, the FTP side has an unrecoverable problem, the application has to restart. In actual practice it simply kills itself and waits for the next time it runs, which can be a while because it's running on a Windows Server 2008 Scheduler job every 30 minutes.

The new architecture will allow the parser to run every minute or two, see if it has anything to do by looking at some metadata, and do its job if needed. I can change a system setting to stop it from running (for example, because I need to do some database maintenance), while letting the downloader continue to work separately.

On the other side, the downloader can run every 5 minutes, snatch the one or two files it needs from NOAA, and shut down cleanly without waiting for the parser. NOAA likes this because the connection is only open for a few seconds, instead of the 27 minutes it stays open right now. And if the NOAA server isn't available, so what? It's a clean shutdown and a clean start a few minutes later.

This design also allows me to do something else: manually upload files for parsing and storage. This helps with testing, migration, service interruptions—all things that the current architecture has made nearly impossible.

I'm not entirely done with the application (and while writing this I just thought of an improvement I'll need to make to prevent infinite retries), but it's close. And I'm really pleased with the application so far. Stay tuned; I can now set a tentative public launch date of March 31st.

Azure table partition schemes

I'm sitting at my remote office working on a conundrum: how to balance human usability against good software design.

The problem is: how can I create an Azure table partitioning scheme that uses Azure efficiently and still allows the user (me) efficiently to troubleshoot problems with the feature in question. This is a direct consequence of the issues I worked on this morning.

The feature is the component of the Weather Now parsing system that stores raw weather data from NOAA temporarily. By "temporarily" I mean, until I delete it. Keeping the raw data will allow me to figure out why problems occur and will allow the application to apply new features to old data in future.

NOAA publishes "cycle files" about every 3-6 minutes. The cycle uses a predictable sequence of 750 file names that repeats about every 4 days. The files go from file000 to file750, then back to file000. Sometimes, however, NOAA restarts the sequence at 0, skips files, or just crashes entirely, so the feature has to handle the file names as random. That said, the files have definite publication times, and generally—to an extent that Weather Now can optimize itself based on the pattern—the files contain weather data gathered within a short time before NOAA publishes the files.

You can have practically unlimited Azure tables in a storage account; I would imagine the number is close to the Int32 maximum value of 2.1 billion. Each table can have billions of partition keys as well. Searching on a combination of Azure table name and partition key takes the same length of time no matter how many tables are in the storage account or how many partition keys each table has. Under the hood, Azure manages the indexing so efficiently that network latency will be the bigger problem in all but a few edge cases.

For Weather Now, my first thought was to create a new table for each month's NOAA files and partition the table by day. So, weather parsing process would put the metadata for a file downloaded right now in the table "noaa201301" and use the partition key "20130127". That would give me about 5,700 rows in each table and about 190 rows in each partition.

I'm reconsidering. Given it's taken 11 years to change the way that Weather Now retrieves and stores weather data, using that scheme would give me 132 tables and 4,017 partitions, each of them kind of small. Azure wouldn't care, but it would over time clutter up the application's storage account. (The account will be cluttered enough as it is, with the millions of individual weather reports tabled by station and partitioned by month.)

On reflection, then, I'm going to create a new table of metadata each year, and partition by month. An Azure table with 69,000 rows (the number of NOAA files produced each year) isn't appreciably less efficient than one with 69 rows or 69 million, as it turns out. It will still partition the data as efficiently as the partition key suggests. But cutting the partitions down 30-fold could make a big difference in efficiency.

I'm open to contrary evidence. In fact, I'd love to find some. But given the frequency of data reads (one every 5 minutes or so), and the thousands of tables already in the application's storage account, I think this is the best way to go.

Why this last Azure move is taking so long

The Inner Drive Technology International Data Center continues to whir away (and use electricity), despite my best efforts to shut it down by moving everything to Microsoft Windows Azure.

Most of the delay finishing the move has nothing to do with its technology. Simply, my real job has taken a lot of time this month as we've worked toward launching a new application tomorrow. Against the 145 hours spent on that project this month, not counting the 38 hours spent helping with other projects, squeezing out the 22 hours I've managed to find for Weather Now has left me falling behind on the Oscar nominees.

For those just joining our story, Weather Now remains the last living application in the IDTIDC. This application shows real-time aviation weather for almost every airport in the world. I wrote the first version in 1998, moved it to its own domain in 2000, and published the last significant update in 2010.

The application benefited for most of its life by having practically unlimited hardware and system software to run on. As a Microsoft partner, I've gotten access to Windows Server, SQL Server, and other goodies for my entire professional life. Moving to Azure changes the calculus radically.

Weather Now runs on Microsoft SQL Server 2012 Enterprise with essentially limitless disk space. In the past 14 years, the application has quietly gathered 50 GB of data, merrily occupying a physical partition scheme that takes up a good bit of a RAID-5 volume. Creating a similar architecture in Azure exceeds my budget a bit: a single medium VM to run the application and its GetWeather component plus a 50 GB SQL Database would cost about $250 per month.

Fortunately, I don't have to do that. Most of the data, you see, hardly ever gets used.

Weather Now usually has around 4,500 current weather observations and 165,000 observations from the last 24 hours. Since each row is small, and the index is positively tiny (only the station ID and observation times are indexed), the current table uses about 1.5 MB of space and the last-24 table uses 43 MB. That doesn't even break a sweat on Azure SQL Database.

No, it's the Archive table that grows like the Beast from Below. That one has all of the past observations since the site started. In some cases I've pruned the table, but basically, it has one row per observation per station. For an average station like O'Hare, that means about 10,500 per year. For a chatty, automated station that spits out a report every 20 minutes, it stores about 27,000 per year. For 2012 alone, that works out to about 47 million* rows—growing at 4 million per month.

What to do? Well, stop storing it was my first thought. It hardly ever gets used, partially because the UI doesn't have a way to pull out historical data.

On the other hand, I've frequently wanted to illustrate blog entries with specific weather reports that have permanent links. And this problem, such as it is, does not have a difficult solution.

So, among its other features, Weather Now 4.0 will store archival weather reports in Azure table storage. It won't have the full 50 GB of material initially, possibly ever; but even if it did, it would only cost about $5 per month to store it. And I've hit on a partitioning scheme that will, eventually, make finding archival data really quick and easy, no matter how much of it there is.

The conclusion should be obvious: If you start looking at things the Azure way, using Azure can save you tons of money. My current estimate of the monthly cost to run Weather Now, assuming current visitor levels and acceptable performance on "very small" Cloud Services instances, is $40 per month. If it eventually amasses 50 GB of archives, it will cost...$42 per month. And if I get thousands of visitors that require upgrading to a "small" instance, I'll start selling subscriptions, but I won't have to buy new equipment because it's Azure.

More on this later. Right now, I've got to get back to work.

* Actually, 47,704,735 rows for 2012, an average of 130,341 new rows per day.