Focus: Robustness
Description: Make sure the system is able to handle and recover from failures
Robustness is a very important property of a system, especially for unattended and/or long-running operations. You should be able to estimate for each module, component, or function you are building, what the need for robustness is, and the things that are most likely to go wrong. You then have to make sure those things do not compromise your system’s overall functionality in a manner more severe than the error actually calls for.
There’s no reason your system shouldn’t be able to gracefully handle at least all of the common errors, most of the anticipated errors, and even a lot of the unanticipated errors that can, and will, occur at run-time. A small, commonly occurring error should not be able to cause a lot of trouble!
An Example
Let’s say you are employed by the FruutXpress company that is in the business selling fresh fruit with express delivery to customers in your city via their web shop. The customer orders needs to be shipped pronto, and there’s a warehouse full of fruit and 30 workers to pick the incoming customer orders into boxes and onto the delivery trucks.
Your task is implementing a program that imports the orders from new FruutXpress web shop, into the company warehouse system. We are told that the incoming orders are available as XML files, one order per file, on an FTP server for your import program to read. You will use the warehouse system’s rather simple API, that will let you query and import data.
On the surface a rather simple task, right? I mean, how hard can it be, just reading a couple of lines of some text file containing how many of which fruits have been ordered, a delivery address, and not much more, and then importing this into the warehouse system (which then can print out picking lists, shipping labels, and whatever else is needed for the warehouse personnel to do their work).
However, when thinking about it some more, you realize it’s a rather important function that is absolutely critical for the FruutXpress company. If your program would stop working, the customers would still place orders at the web shop, but the orders will never reach the warehouse system, and the delivery trucks and workers would sit idle, because they wouldn’t get any new orders to pick. Or rather, they would be running around screaming about probably having to work overtime in order to fulfill all pending deliveries, and you’d have warehouse management on the phone within minutes complaining!
Clearly you must design and build a really robust import program! It’s important that it does its job well, and only rather catastrophic errors, such as the network being down, or a server crash, should be able to force it to abort processing. It should be able to keep processing orders, even in the face of many different error conditions.
More about this in the next post.