The Daily Parker

Politics, Weather, Photography, and the Dog

It's not the good times they care about, it's the bad

The repercussions from Monday's data-recovery debacle continued through yesterday.

By the time business started Tuesday morning, I had restored the client's application and database to the state it had at the moment of the upgrade, and I'd entered most of their appointments, including all of them through tomorrow (Thursday). When the client started their day, everything seemed to be all right, except for one thing I also didn't know about their business: some of their customers pay them based on the appointment ID, which is nothing more than a SQL IDENTITY column in the database.

If you know how databases work, you know that IDENTITY columns are officially non-deterministic. In this specific case, the column increments by one every time it adds a row, but also in this specific case, I didn't re-enter the data in the same order it was originally entered, since I prioritized the earlier appointments.

We've gotten through the problem now, and the client no longer want to put my head on a spike, so I will now take a moment for an after-action review that might help other software developers in the future.

First, the things I did right:

  • When I deployed the upgrade Saturday, I preserved the state of the database and application at exactly that moment.
  • All of the data in the system, every field of it, was audited. It was trivially easy to produce a report of every change made to the system from roll-out Saturday afternoon through roll-back Monday night.
  • When I rolled back the upgrade Monday night, I preserved the state of the upgraded database and application at exactly that moment.
  • When the client first noticed the problem, I dropped everything else and worked out a plan with them. The plan centered around getting their business back up first, and then dealing with the technology.
  • Their customers were completely back to normal at the start of business Tuesday.
  • The application runs on Windows Azure, which made preserving the old application state not only easy, but possible.

So what should I have done better?

  • My biggest error was overconfidence in my ability to roll back the upgrade. No matter what other errors I made, this was the root of all of them.
  • The second major error was not testing the UI on Internet Explorer 8. Mitigating this was the fact that neither I nor my client was aware that the bulk of their customers used IE8. However, given that people using IE8 were totally unable to use the application, even if the numbers of customers using IE8 was very small, the large impact should have put IE8 near the top of my regression test checklist.
  • Instead of spending a couple of hours re-entering data, I should have written a script to do it.
  • I have always regretted (though never more than today) publicizing the appointments IDENTITY column to the end user, because it's normal they'd use this ID for business purposes. This illustrates the danger—not just the sloppy design—of using a single database field for two purposes. Any future version of the application will have an OrderID field that is not a database plumbing field.

All in all, the good things outweighed the bad, and I may get back in my client's good graces when I roll out the next update. You know, the one that works on IE8, but still solves the looming problem of the platform's age.

And the day started so well...

At 8:16 this morning, a long-time client sent me an email saying that one of his customers couldn't was getting a strange bug in their scheduling application. They could see everything except for the tabbed UI control they needed to use. In other words, there was a hole in the screen where the data entry should have been.

Here's how the rest of the day went around this issue. It's the kind of thing that makes me proud to be an engineer, in the same way the guys who built Galloping Gertie were proud.

It all started when I updated a Windows Azure cloud service from the no-longer-supported SDK 1.7 running on Windows Server 2008 to the current SDK (2.2) and operating system (Windows Server 2012 R2). I also upgraded the language from C# 4.0 to C# 4.5.1, which is only possible on WS2012R2.

This upgrade started months ago, and proceeded slowly because both I and the clients had other priorities. I mean, who wants to spend a lot of money upgrading a platform without upgrading the application running on it? So the last build of the application went to production in October, and I haven't touched it since. I mean, it worked fine, why mess with it? Other than the fact that the operating system and Azure SDK are no longer supported.

Before pushing the update, I thoroughly tested the application. I mean, unit tests up the ying, with a tens-of-steps-long regression test on my local, and on an Azure test instance, before even looking askance at the Production instance. When I had tested everything I could imagine, I did this:

  1. Stopped the application, to ensure no one changed any data during the upgrade.
  2. Made a full copy of the production database ("CREATE DATABASE productioncopy AS COPY OF production")
  3. Once the data was fully copied, I uploaded the new bits to the Staging slot of the application.
  4. I updated the configuration info to the current standards.
  5. VIP swap! (I swapped the staging and production instances, so the old production instance was now in the staging slot.)
  6. And....it's running just fine. All that planning and testing worked!

So what happened? Well, it turns out there's one thing I didn't anticipate: Internet Explorer 8, released five years ago Thursday, and known to have difficulties with JavaScript. Plus, the controls we used when we orignally deployed in January 2008, made by Infragistics, have known incompatibilities with IE8, but again: the application has worked just fine the whole time.

Since everything worked just fine on earlier versions of the application, and since this update didn't directly change the UI, and since IE8 hasn't been supported in quite some time, I figured there wouldn't be any problems.

It turns out that a sizable portion of my client's customers use IE8, because they're big hospitals with big IT departments and little budgets for updates.

Once I realized with abject horror that the application was simply broken for most of the people using it, I resigned myself to rolling back to the previous release, which had worked just fine. When I got home, I started this task, and the following things happened:

  1. Once again, I stopped the application.
  2. The actual database restore went fine, as did the VIP swap putting the previous version back in the Production slot and the new version in the Staging slot.
  3. When the application started up, I realized I'd forgotten to roll back the configuration information for the logging and messaging component. So the application failed to start.
  4. I rolled back the config.
  5. The application again failed to start. Only now, because the logging and messaging component is the part that's failing, I can't see any diagnostics.
  6. Fortunately, I deployed the application with Remote Desktop enabled, so I tried connecting to the virtual machine directly.
  7. The Remote Desktop user account had expired.
  8. Fortunately I use great source control. In Mercurial, I updated to the last production build before the update, and loaded it into Visual Studio.
  9. Tried to load into Visual Studio, and failed. See, I no longer have the Azure SDK v1.7. I never installed it on this machine, in fact. I'm running SDK 2.2, and I have no easy way of running an older version.

So, as far as I knew at this point, there is simply no way to get into the application, and no way for me to re-upload the old version.

I decided to try a different tack. I rolled back the rollback and restarted the new version. I also started trying to get my last remaining Windows XP machine running so that I could confirm the bug, and start testing fixes on a Test instance running Windows Server 2012 R2.

Getting a 10-year-old laptop to boot, let me log in, stop wasting time with all the detritus it acquired in its years of service, connect to my network, and open up IE8, took 45 minutes.

Some time in there I walked Parker.

So now, I can see that the error exists in IE8, and I also have found an article on how to reset the RDP password expiration date. Only, I'm really tired, and I am worried I'll make stupid errors if I keep trying to debug this right now.

So I have two approaches I will try first thing in the morning: first, roll back to the October release, and manually update the RDP expiration date so I can remote in and debug the configuration problem. Then I'll have to re-create all the data my client added yesterday, which will take me at least an hour. If I'm supremely lucky I'll have this done by 8am. Since I've had no luck at all so far on this upgrade, I am not optimistic.

Second, I'll start removing the outdated Infragistics code. Believe it or not, jQuery works fine on IE8, despite it being pretty much the latest thing in user interface languages. It's the custom crap Infragistics pushed out years ago that fails. Unfortunately I won't be able to deploy this before leaving on Thursday morning. Fortunately the application isn't going to stop working suddenly; the OS and SDK are no longer supported, but they won't actually turn the OS off until June.

And there's the irony in a nutshell. I thought I did everything right in the deployment cycle, especially the part where I got three months ahead of the due date. The things that went wrong to get me to this state of frustration and exhaustion were numerous and tiny, kind of like the things that go wrong to cause an aviation accident. That said, the client has suffered no data loss, and I preserved a whole catalog of options to fix the problem (relatively) quickly. This isn't the disaster it would have been without the deployment tools you get with Azure.

Plus, I've learned to test everything on IE8 whenever health care companies are involved. Sheesh.

Morning link round-up

If I have time, I'll read these articles today:

Now, to work.

About this blog (v 4.2)

Parker, 14 weeksI'm David Braverman, this is my blog, and Parker is my 7½-year-old mutt. I last updated this About... page in September 2011, more than 1,300 posts back, so it's time for a refresh.

The Daily Parker is about:

  • Parker, my dog, whom I adopted on 1 September 2006.
  • Politics. I'm a moderate-lefty by international standards, which makes me a radical left-winger in today's United States.
  • The weather. I've operated a weather website for more than 13 years. That site deals with raw data and objective observations. Many weather posts also touch politics, given the political implications of addressing climate change, though happily we no longer have to do so under a president beholden to the oil industry.
  • Chicago (the greatest city in North America), and sometimes London, San Francisco, and the rest of the world.
  • Photography. I took tens of thousands of photos as a kid, then drifted away from making art until early 2011 when I finally got the first digital camera I've ever had whose photos were as good as film. That got me reading more, practicing more, and throwing more photos on the blog. In my initial burst of enthusiasm I posted a photo every day. I've pulled back from that a bit—it takes about 30 minutes to prep and post one of those puppies—but I'm still shooting and still learning.

I also write a lot of software, and will occasionally post about technology as well. I work for 10th Magnitude, a startup software consultancy in Chicago, I've got more than 20 years experience writing the stuff, and I continue to own a micro-sized software company. (I have an online resume, if you're curious.) I see a lot of code, and since I often get called in to projects in crisis, I see a lot of bad code, some of which may appear here.

I strive to write about these and other things with fluency and concision. "Fast, good, cheap: pick two" applies to writing as much as to any other creative process (cf: software). I hope to find an appropriate balance between the three, as streams of consciousness and literacy have always struggled against each other since the first blog twenty years ago.

If you like what you see here, you'll probably also like Andrew Sullivan, James Fallows, Josh Marshall, and Bruce Schneier. Even if you don't like my politics, you probably agree that everyone ought to read Strunk and White, and you probably have an opinion about the Oxford comma—punctuation de rigeur in my opinion.

Thanks for reading, and I hope you continue to enjoy The Daily Parker.

Shaving the yak

I spent 4½ hours today upgrading three low-traffic websites in order to shut down an Azure database that cost me $10 per month.

The problem is this: I continually improve the Inner Drive Extensible Architecture as I learn better techniques for doing my craft. The IDEA began in 2002, and the industry changes rapidly, so every so often it changes significantly enough that things using earlier versions break when they're upgraded. About a year ago, version 2 ended and version 3 came out, breaking everything that used version 2.

Except, I still had some things out there using version 2, including its clunky data architecture. Therefore, I had to keep its clunky data architecture running on its own Azure database, at a cost of about $10 a month.

The three sites involved date from 2004, 2006, and 2007. All three moved to Microsoft Windows Azure by mid-2012, but unfortunately that means all three used the Azure SDK 1.7, which Microsoft killed somewhere around November 2012.

Upgrading from a dead version to a live version requires some effort. So for 4½ hours today, I dealt with version conflicts, expired publishing certificates, niggling little configuration errors, and a virtual machine that needed a critical upgrade. Along the way I gained 10 Stack Overflow reputation points because other people have felt my pain, but didn't know how to get past it.

This is a good example of yak shaving, and also the fundamental principle of software development: enlightened laziness.* Had a client needed me to do this work, each upgrade would have cost the client around $300 (which, being a salaried consultant, I would not have actually received). So it wasn't horribly expensive, but remember: I did this to save $10 per month.

So, from a commercial perspective, today's activities made no sense. Yet I feel completely satisfied that I solved a problem today that had bothered me for months.

* Why spend 10 minutes on a task when you can spend 4 hours automating it? By these words ye shall know software professionals.

Error installing Entity Framework 6 in a very old Web project

I remember, back in .NET prehistory (2001), that one of .NET's biggest benefits was to be the end of DLL hell. Yet I spent half an hour this afternoon trying to get a common package (Entity Framework 6) to install in a project that never had that package in the first place—because of a version conflict with .NET itself.

When I tried to install EF6, the NuGet package installer failed the installation with the message "This operation would create an incorrectly structured document". A quick check of StackOverflow suggested a couple of possible causes:

  • The Entity Framework installer creates an invalid web.config file because it gets confused about the older project's XML namespaces.
  • The EF installer chokes on .NET 4.5 and .NET 4.5.1 because it's broken.

Anyone who's spent time with Microsoft products should immediately suspect that hypothesis #2 is unlikely. No, seriously: Microsoft releases things that have bad usability, rude behavior, and incomplete features all the time, but they have some incredible QA people. This fits that pattern: the installer script works fine. It just has pretty dismal error reporting.

So after removing every trace of EF from the relevant files, and downgrading the app to .NET 4.0 from .NET 4.5.1, EF still wouldn't install. Only at this point did I start thinking about the problem.

Let's review: I had an error message about an incorrectly-structured document. The document in question was almost certainly web.config, which I could tell because the EF6 installation kept changing it. The web.config file is an XML document. XML allows you to specify a namespace. This particular XML document had a namespace defined. A Stack Overflow commenter had mentioned namespaces. Um...

At this point I changed the web.config header element from this:

<configuration xmlns="http://schemas.microsoft.com/.NetConfiguration/v2.0">

to this

<configuration>

That fixed it.

The moral of this story: read error messages carefully, form hypotheses based on the data you have available, and even before that, stop and think. And even if you're not a Microsoft developer working on NuGet package installer scripts, always give as much detail as possible in error messages, so that developers who read them can spend less time trying to understand why the operation they thought was simple took so long to accomplish.

Regular readers of this blog know how irritated I get when error messages don't actually explain the error. I'm on developers for this all the time. It's rude; it's lazy; it costs people irrecoverable time. This is one of those times.

The opposite of continuous delivery

Yesterday I wrote that I'd spend this morning setting up the Inner Drive Website as a continuous-delivery application running in Windows Azure cloud services. Well, that was a bit optimistic. Here's what I did instead:

  • Shook my head sadly that the last time I published the site at all was last March. That's a little dis-continuous, I think.
  • Upgraded the application to .NET 4.51, the Azure SDK 2.2, Azure Storage 3.0, and the latest Inner Drive Extensible Architecture build.
  • Moved the master code repository to my real provider.
  • Upgraded an administration feature to a new database schema.
  • Created an entirely new copy of the site's database to accommodate this, because other applications are using the old database schema.
  • Beat Web deploy into submission to avoid the 30-minute Azure publish time. (Why 30 minutes? Because I have a slow DSL and because the site has about 30 MB of SDK files (register!).
  • Corrected an idiotic mistake that prevented people from resetting their forgotten passwords.

The total time for this mishigos: 5 hours, including 6 minutes for the last bullet point.

So I haven't actually converted the repository to Git, let alone set up a CI server or anything like that. And now, I will go walk the dog, and work on this no more.

Continuous integration with Azure cloud services

This is just a note to myself, really. Last weekend I spent an hour setting up continuous deployment of an Azure website using Git.

At work, we're moving towards doing the same thing with Azure cloud services, which has a different set of problems to solve.

I'll have more to say about this once we've done it. Meanwhile, here are a few of the resources we're reading to get started:

This weekend I may move Inner-Drive.com to CI/CD as well. It'll be my Sunday Morning project, possibly.

Continuous deployment for Azure web sites through Git

A co-worker sent around this post from Iris Classon explaining how to set up continuous deployment in Azure. She used Visual Studio Online and Team Foundation Server. I spent about two hours this morning doing it with Visual Studio 2013 and Bitbucket. There are a couple of gotchas the way I did it:

  • First, I made a mistake, and started with the Visual Studio 2012 MVC template. There's actually a VS2013 MVC template that has better authentication features, but, well, sometimes you miss things, right?
  • Because of this, I had to add the ASP.NET Universal Providers 2.0 package from NuGet.
  • Jumping to the end, it also means I had to copy WebMatrix.Data.dll directly into the deployed Web site by FTP. It didn't get deployed through Git.
  • When I created a new Web database using the Azure portal quick-create, no amount of convincing could persuade the portal to create the database in the North Central U.S. data center just outside Chicago. Instead, it created the thing in Southeast Asia. Since the Web site was in Chicago, this was inconvenient. I wound up just adding a new database to an existing Azure server.
  • The ASP.NET universal providers required create table permissions when the application started, to create the app's membership tables. Since I had created the database on an existing server, this required me to add the application user account to the database owners role for a moment, which I don't like doing. Oh, and because it was a new database, I had to grant select, insert, update, and delete permissions to the application user account manually.

The app is pretty simple, and may not last long in its current incarnation. If you're really curious you can see it here.

The app, however, isn't the point; continuous deployment is. And I found, once I got it running, that pushing changes to the site's repository in Bitbucket updated the site transparently.

I'll play with this a little more when I have time.

The IDTIDC has left the building

Just getting a server rack out of one's apartment is only half the battle. Disposal takes a little effort, too.

Fortunately there's Craigslist. Unfortunately, people are flaky. So on the second attempt, the former Inner Drive Technology International Data Center found a new home in a small Loop family law firm.

I actually felt a little twinge. The rack, the servers, the peripherals...they're actually gone. But: Inner Drive Technology is now 100% Cloud.