What the Exxon Valdez and database clusters have in common
I was recently asked to comment on the proposed design for a project by a prospective new customer. The project involved a high number of simultaneous users, contributing small amounts of data each, and was to be hosted in the Cloud. The exact details were To Be Decided, but Amazon EC2 and MySQL were floated as likely candidates for the hosting and RDMS components. (Although my ultimate recommendations would have at least considered using SQL Azure instead, given some of the time constraints and other technologies involved that would have dovetailed into the wider solution.)
The discussion got me thinking about the topic of database clustering, as it relates to performance and scalability concerns. During the course of the discussion of the above project with the client’s Technical Director, it transpired that, despite the organisation concerned having used clustering in an attempt to improve performance previously, that approach had failed.
The above discussion didn’t surprise me. It’s a misunderstanding I’ve witnessed a number of times, whereby people confuse the benefit that database clustering actually bestows. In short, people often believe that using such a design aids scalability and performance. Unfortunately, this isn’t the case. What such an architecture actually provides is increased reliability, not performance. (It’s actually less performant than a standalone database, since any CRUD operations need to be replicated out to duplicate databases). Which is to say that if one database goes down, another is in place to quickly take over and keep processing transactions until the failed server can be brought back online.
The analogy I usually give people when discussing the benefits and limitations of clustering is that it’s a bit like the debate about double hulls on oil tankers. As you may know, after the Exxon Valdez disaster the US Government brought in legislation that stated every new oil tanker built for use in US ports was to be constructed with double hulls. The aim was admirable enough: to prevent such an ecological disaster from ever happening again. However, it was also a political knee-jerk reaction of the worst kind. Well intentioned, but not based on measurable facts.
Of perhaps most relevance to the topic was the small fact that those parts of the Exxon Valdez that were punctured were in fact double-hulled (the ship was punctured on its underside, and it was double-hulled on that surface). Added to this is the fact that a double hull design makes ships less stable, so they’ll be that little bit more likely to collide with obstacles that more manoeuvrable designs can avoid . And, just like in database clustering, the added complexity involved actually reduces capacity. (In the case of ships, the inner hull is smaller; in databases the extra replication required means less transactions can be processed in the same amount of time with the same processing power.)
I’ve been involved in a couple of projects that involved either large amounts of data in one transaction impacting on a replicated database, or large numbers of smaller individual transactions being conducted by simultaneous users. In neither case, in my experience, was clustering a good solution to the design challenges faced.
The first project I have as a point of reference was one I worked on back in 2007, that involved a business intelligence application that collected around a million items of data a month via a userbase of 400 or so. I was the lead developer on that 7-person team, and so had complete control over the design chosen. I also had the advantage of having at my disposal one of the finest technical teams I’ve ever worked with.
The system involved a SQL Server database that was used by around 30 back office staff, OLAP cubes being built overnight for BI analysis, and certain sub-sections of the schema being replicated out to users that accessed the system via PDAs over GPRS (which of course will have been replaced by 3G / 4G now). The PDA users represented the bulk of those 400 users of the system.
The design we settled upon was one that traded off normalisation and database size for the least impact on those parts of the schema that needed to be replicated out to the PDAs. So, CRUD updates made in the back office system were only transferred to near-identical, read-only tables used by the PDAs once an hour (this could be fine-controlled during actual use to aid performance or to speed up propagation of information as required). This approach meant that the affected tables had less sequential CRUD operations to be carried out whenever the remote users synched over their low-bandwidth connections. And if they were out of range of connectivity at all, their device still worked using on-board, read-only copies of the backoffice data required.
The second main consideration in the design involved a large data import task that happened once every six weeks. One of my developers produced a solution that was algorithmically sound, but that quickly reached the limitations of what an ORM-driven approach can do. In short, it took several hours to run, grinding through thousands of individual DELETE, INSERT and UPDATE statements. And if any consistency errors were found in the data to be imported (which was not an uncommon occurrence) the whole process needed to be gone through again, and again, until eventually it ran without hiccups. It wasn’t uncommon to take a skilled DBA 24 hours to cleanse the data and complete the import task successfully. Meanwhile, the efficiency of those replicated parts of the schema used by the PDAs would be taking a battering. A better approach was needed.
In the end, I opted for using SQL Server’s XML data type to pass the bulk upload data into a stored procedure in a single transaction. Inside the procedure, wrapped in a reversible TRANSACTION, just those parts of the data that represented actual changes were updated. (E.g., it wasn’t uncommon in the imported data to have a DELETE instruction, followed by an INSERT instruction that inserted exactly the same data; the stored proc was smart enough to deal with that and only make those changes that affected the net state of the system). I designed the stored proc so that any errors would cause the process to be rolled back, and the specific nature of the error to be reported via the UI. The improved process ran in under a second, and no longer required the supervision of a DBA. Quite a difference from 24 hours.
The second project that informs my views of clustered database designs was one that I wasn’t the design authority on. In this case, I was just using the database(s) for some other purpose. Prior to my involvement, a SQL Server cluster involving three instances of the database was set up, and kept in sync. The solution was designed for use by a vendor of tickets for all sorts of events, including popular rock concerts. It wasn’t an uncommon occurrence for the tickets to go on sale, and for an allocation of many thousands to be sold out in literally ten seconds flat, as lots of fans (and I’m sure ticket touts too) sat feverishly pressing F5, waiting for the frenzy to start. (And sometimes, if the concert organiser got their price point wrong, you’d find that only a few tickets were sold for an over-priced event, but that’s another story!)
In the case of this design, I never did see the failover capabilities come into play. Which is to say that each of the three SQL Server instances that replicated the same data for reliability reasons all stayed up all of the time. I had a feeling that if one ever went down for reasons of load, however, it wouldn’t have been long before the others would have suffered the same fate. And since it was an on-premise deployment rather than being cloud-based, something like a power cut would have stopped the show dead.
It’s not that common for hardware to fail just because a high number of requests are being made simultaneously. All that will happen is that some users won’t get through (and you as the site owner will never know that was the case). It’s not like the server will shut down in shock. Even the recent low-tech attacks to large online retailers like Amazon using amateur tools like LOIC didn’t damage any critical infrastructure. At best, such conditions can saturate traffic for a short while. And often they don’t achieve even that much.
As a final point, I’d note that there are far greater concerns when designing an authenticated, public-facing system, such as CSRF vulnerabilities. Any attempt to address performance concerns by using clustering will inevitably adversely affect those security concerns. Because commonly-accepted solutions to same typically rely on data being reliably saveable and retrievable across short time frames (rather than getting in sync eventually as most clustering solutions allow for).