The Daily Parker

Politics, Weather, Photography, and the Dog

Take the Orange Line to King's Landing

While we're getting ready to celebrate the birth of Baby X this Xmas, links are once again stacking up in my inbox. Like these:

That might be it for The Daily Parker today.

What's going on with the Microsoft Azure blog?

For the last couple of days, I've had trouble getting to Microsoft's Azure blog. From my office in downtown Chicago, clicking the link gives me an error message:

The resource you are looking for has been removed, had its name changed, or is temporarily unavailable.

However, going to the same URL from a virtual machine on Azure takes me to the blog. So what's going on here? It took a little detective work, but I think Microsoft has a configuration error one of a set of geographically-distributed Azure web sites, they don't know about it, and there's no way to tell them.

The first step in diagnosing a problem like this is to see if it's local. Is there something about the network I'm on that prevents me from seeing the website? This is unlikely for a few big reasons: first, when a local network blocks or fails to connect to an outside site, usually nothing at all happens. This is how the Great Firewall of China works, because someone trying to get to a "forbidden" address may get there slowly, normally, or not at all—and it just looks like a glitch. Second, though, the root Azure site is completely accessible. Only the Blog directory has an error message. Finally, the error message is coming from the foreign system. Chrome confirms this; there's a HTTP 200 (OK) response with the content I see.

All right, so the Azure Blog is down. But that doesn't make a lot of sense. Thousands of people read the Azure blog every day; if it were down, surely Microsoft would have noticed, right?

So for my next test, I spun up an Azure Virtual Machine (VM) and tried to connect from there. Bing! No problem. There's the blog.

Now we're onto something. So let's take a look at where my local computer thinks it's going, and where the VM thinks it's going. Here's the nslookup result for my local machine, both from my company's DNS server and from Google's 8.8.8.8 server:

Now here's what the VM sees:

Well, now, that is interesting.

From my local computer, sitting in downtown Chicago, both Google and my company's DNS servers point "azure.microsoft.com" to an Azure web site sitting in the North Central U.S. data center, right here in Chicago. But for the VM, which itself is running in the East U.S. data center in southern Virginia, both Microsoft's and Google's DNS servers point the same domain to an Azure web site also within the East U.S. data center.

It looks like both Microsoft and Google are using geographic load-balancing and some clever routing to return DNS addresses based on where the DNS request comes from. I'd bet if I spun up an Azure VM in the U.S. West data center, both would send me to the Azure blog running out there.

This is what massive load balancing looks like from the outside, by the way. If you've put your systems together correctly, users will go to the nearest servers for your content, and they'll never realize it.

Unfortunately, the North Central U.S. instance of the Microsoft Azure blog is down, has been down for several days, and won't come up again until someone at Microsoft realizes it's down. Also, Microsoft makes it practically impossible to notify them that something is broken. So those of us in Chicago will just have to read about Azure on our Azure VMs until someone in Redmond fixes their broken server. I hope they read my blog.

Weird routing issues at CDG

No, not aviation routing; IP routing.

From the Terminal 2 American Airlines club, I am unable to hit most *.cloudapp.net IP addresses. This is significant because it's basically all of Microsoft Azure, including logon.microsoft.net, Weather Now, and a bunch of other sites I use or have some responsibility for.

I've just spent a few minutes testing DNS (everything is fine there) and then using tracert and pathping, and it looks like the entire 168.62.0.0/16 and 168.61.0.0/16 ranges are just not visible from here. (The Daily Parker is also in Azure, but its IP is in the 191.238.0.0/16 subnet, which seems to be visible just fine.)

I wonder if Microsoft knows that its U.S. East data center is being blocked by some French ISP? Or why?

Schneier on why the NSA has made us less safe

Security expert Bruce Schneier is not an alarmist, but he is alarmed:

In addition to turning the Internet into a worldwide surveillance platform, the NSA has surreptitiously weakened the products, protocols, and standards we all use to protect ourselves. By doing so, it has destroyed the trust that underlies the Internet. We need that trust back.

By weakening security, we are weakening it against all attackers. By inserting vulnerabilities, we are making everyone vulnerable. The same vulnerabilities used by intelligence agencies to spy on each other are used by criminals to steal your passwords. It is surveillance versus security, and we all rise and fall together.

Security needs to win. The Internet is too important to the world -- and trust is too important to the Internet -- to squander it like this. We'll never get every power in the world to agree not to subvert the parts of the Internet they control, but we can stop subverting the parts we control. Most of the high-tech companies that make the Internet work are US companies, so our influence is disproportionate. And once we stop subverting, we can credibly devote our resources to detecting and preventing subversion by others.

It really is kind of stunning how much damage our intelligence services have done to the security they claim to be protecting. I don't think everyone gets it right now, but the NSA's crippling the Internet will probably be our generation's Mosaddegh.

Maybe I'll have free time later today

If so, these are queued up:

More later...

The heart bleeds for OpenSSL

Bruce Schneier, not one for hyperbole, calls the Heartbleed defect an 11 on a 10 scale:

Basically, an attacker can grab 64K of memory from a server. The attack leaves no trace, and can be done multiple times to grab a different random 64K of memory. This means that anything in memory -- SSL private keys, user keys, anything -- is vulnerable. And you have to assume that it is all compromised. All of it.

"Catastrophic" is the right word.

At this point, the odds are close to one that every target has had its private keys extracted by multiple intelligence agencies. The real question is whether or not someone deliberately inserted this bug into OpenSSL, and has had two years of unfettered access to everything. My guess is accident, but I have no proof.

It turns out, Windows systems don't use OpenSSL very much, favoring TLS 1.2 these days. So if you're visiting a Windows system (basically anything with ".aspx" at the end), you're fine.

Still, if you've used Yahoo! or any other system that has this bug, change your password. Now.

And the day started so well...

At 8:16 this morning, a long-time client sent me an email saying that one of his customers couldn't was getting a strange bug in their scheduling application. They could see everything except for the tabbed UI control they needed to use. In other words, there was a hole in the screen where the data entry should have been.

Here's how the rest of the day went around this issue. It's the kind of thing that makes me proud to be an engineer, in the same way the guys who built Galloping Gertie were proud.

It all started when I updated a Windows Azure cloud service from the no-longer-supported SDK 1.7 running on Windows Server 2008 to the current SDK (2.2) and operating system (Windows Server 2012 R2). I also upgraded the language from C# 4.0 to C# 4.5.1, which is only possible on WS2012R2.

This upgrade started months ago, and proceeded slowly because both I and the clients had other priorities. I mean, who wants to spend a lot of money upgrading a platform without upgrading the application running on it? So the last build of the application went to production in October, and I haven't touched it since. I mean, it worked fine, why mess with it? Other than the fact that the operating system and Azure SDK are no longer supported.

Before pushing the update, I thoroughly tested the application. I mean, unit tests up the ying, with a tens-of-steps-long regression test on my local, and on an Azure test instance, before even looking askance at the Production instance. When I had tested everything I could imagine, I did this:

  1. Stopped the application, to ensure no one changed any data during the upgrade.
  2. Made a full copy of the production database ("CREATE DATABASE productioncopy AS COPY OF production")
  3. Once the data was fully copied, I uploaded the new bits to the Staging slot of the application.
  4. I updated the configuration info to the current standards.
  5. VIP swap! (I swapped the staging and production instances, so the old production instance was now in the staging slot.)
  6. And....it's running just fine. All that planning and testing worked!

So what happened? Well, it turns out there's one thing I didn't anticipate: Internet Explorer 8, released five years ago Thursday, and known to have difficulties with JavaScript. Plus, the controls we used when we orignally deployed in January 2008, made by Infragistics, have known incompatibilities with IE8, but again: the application has worked just fine the whole time.

Since everything worked just fine on earlier versions of the application, and since this update didn't directly change the UI, and since IE8 hasn't been supported in quite some time, I figured there wouldn't be any problems.

It turns out that a sizable portion of my client's customers use IE8, because they're big hospitals with big IT departments and little budgets for updates.

Once I realized with abject horror that the application was simply broken for most of the people using it, I resigned myself to rolling back to the previous release, which had worked just fine. When I got home, I started this task, and the following things happened:

  1. Once again, I stopped the application.
  2. The actual database restore went fine, as did the VIP swap putting the previous version back in the Production slot and the new version in the Staging slot.
  3. When the application started up, I realized I'd forgotten to roll back the configuration information for the logging and messaging component. So the application failed to start.
  4. I rolled back the config.
  5. The application again failed to start. Only now, because the logging and messaging component is the part that's failing, I can't see any diagnostics.
  6. Fortunately, I deployed the application with Remote Desktop enabled, so I tried connecting to the virtual machine directly.
  7. The Remote Desktop user account had expired.
  8. Fortunately I use great source control. In Mercurial, I updated to the last production build before the update, and loaded it into Visual Studio.
  9. Tried to load into Visual Studio, and failed. See, I no longer have the Azure SDK v1.7. I never installed it on this machine, in fact. I'm running SDK 2.2, and I have no easy way of running an older version.

So, as far as I knew at this point, there is simply no way to get into the application, and no way for me to re-upload the old version.

I decided to try a different tack. I rolled back the rollback and restarted the new version. I also started trying to get my last remaining Windows XP machine running so that I could confirm the bug, and start testing fixes on a Test instance running Windows Server 2012 R2.

Getting a 10-year-old laptop to boot, let me log in, stop wasting time with all the detritus it acquired in its years of service, connect to my network, and open up IE8, took 45 minutes.

Some time in there I walked Parker.

So now, I can see that the error exists in IE8, and I also have found an article on how to reset the RDP password expiration date. Only, I'm really tired, and I am worried I'll make stupid errors if I keep trying to debug this right now.

So I have two approaches I will try first thing in the morning: first, roll back to the October release, and manually update the RDP expiration date so I can remote in and debug the configuration problem. Then I'll have to re-create all the data my client added yesterday, which will take me at least an hour. If I'm supremely lucky I'll have this done by 8am. Since I've had no luck at all so far on this upgrade, I am not optimistic.

Second, I'll start removing the outdated Infragistics code. Believe it or not, jQuery works fine on IE8, despite it being pretty much the latest thing in user interface languages. It's the custom crap Infragistics pushed out years ago that fails. Unfortunately I won't be able to deploy this before leaving on Thursday morning. Fortunately the application isn't going to stop working suddenly; the OS and SDK are no longer supported, but they won't actually turn the OS off until June.

And there's the irony in a nutshell. I thought I did everything right in the deployment cycle, especially the part where I got three months ahead of the due date. The things that went wrong to get me to this state of frustration and exhaustion were numerous and tiny, kind of like the things that go wrong to cause an aviation accident. That said, the client has suffered no data loss, and I preserved a whole catalog of options to fix the problem (relatively) quickly. This isn't the disaster it would have been without the deployment tools you get with Azure.

Plus, I've learned to test everything on IE8 whenever health care companies are involved. Sheesh.

About that iOS "flaw"

Security guru Bruce Schneier wonders if the iOS security flaw recently reported was deliberate:

Last October, I speculated on the best ways to go about designing and implementing a software backdoor. I suggested three characteristics of a good backdoor: low chance of discovery, high deniability if discovered, and minimal conspiracy to implement.

The critical iOS vulnerability that Apple patched last week is an excellent example. Look at the code. What caused the vulnerability is a single line of code: a second "goto fail;" statement. Since that statement isn't a conditional, it causes the whole procedure to terminate.

If the Apple auditing system is any good, they would be able to trace this errant goto line not just to the source-code check-in details, but to the specific login that made the change. And they would quickly know whether this was just an error, or a deliberate change by a bad actor. Does anyone know what's going on inside Apple?

Schneier has argued previously that the NSA's biggest mistake was dishonesty. Because we don't know what they're up to, and because they've lied so often about it, people start to believe the worst about technology flaws. This Apple error could have been a stupid programmer error, merge conflict, or something in that category. But we no longer trust Apple to work in our best interests.

This is a sad state of affairs.