Designing for Robustness

This entry is part 6 of 8 in the series Design Principles

Focus: Robustness
Description: Robustness is not just about exception handling! System architecture and design plays an important role!

In the last post, we introduced an example of an application that clearly needed to be robust. Now we need to figure out how to make design decisions that help us achieve that goal.

A naive implementation of the order import application from the last post – fetching orders from a web shop via FTP, importing them into the local warehouse system – is something like the following pseudocode;

    void ImportOrders() 
    {
        do { 
            FTPServer webshopFTP = FTPServer.Open(webshopFTPaddress, user, password);
            foreach (RemoteFile remoteFile in webshopFTP.Files)
            {
                XmlDocument orderDoc = XmlDocument.Load(remoteFile.DataStream);
                ImportOrder(orderDoc);
                remoteFile.Delete();
            }
        } while(!doomsday); 
    }

Some may think that making the processing robust simply amounts to adding a try-catch inside the inner loop in order to catch per-file errors, and another try-catch inside the outer loop. While I agree that it would indeed make this implementation considerably more robust, it’s a bit like thinking that adding sprinkles to your vanilla ice-cream makes it a five-star dessert! Its better, but there’s certainly room for improvement!

We need to consider robustness before rushing to do implementation. We need to consider the various error scenarios. For each scenario we need to consider our options to improve robustness in the face of errors, and find an appropriate strategy to handle the error, such as retrying, fallback to an alternate strategy, ignore the error and continue, or throwing an exception, aborting processing.

Let’s start with one of the most likely sources of errors, the network connection to the FTP Server.

Unnecessary dependencies

So, your application might not be able to reach the FTP Server at all times. No amount of try-catch will help you reach it when the network is down. But what would improve robustness, then? Well, if we did not need to reach it at all times!

One important thing to realize is that the naive implementation needs both the FTP Server and the Warehouse system up and running at the same time, in order to import anything. It may seem like they obviously must both be up, because we are in a sense moving orders from the FTP to the warehouse system. But this is not really true, and this misunderstanding has a negative effect on reliability.

Imagine what would happen if for some reason the network connection to the FTP server was unexpectedly lost for a couple of hours. Maybe the FTP Server is located within driving distance, and someone in desperation fills a USB flash drive with hundreds of orders from the web shop and drives to the warehouse, handing over the drive to you. Unfortunately, this would be of no immediate help, as the naive application would not be able to process these orders from the removable drive, as it only knows how to fetch orders from an FTP Server! Sure, if you’re lucky you might be able to do an rather stressful ad-hoc set up of a local FTP Server and re-configure your application to use that, but there are other problems with this design.

This version of the application is also always downloading and importing one order at a time. Let’s assume that downloading is rather quick, and takes no more than a tenth of a second per order, but importing into the warehouse system is rather slow, lets say around 30 seconds per order. If you were told well in advance that the warehouse external connection will be down between 10 to 12 for maintenance, your application would still not be able to do any processing between 10 and 12. Even if there were hundreds of orders waiting at the FTP Server by 9:50, your application would only manage to fetch and process about 20 orders until 10:00, and then processing would stop when the application was unable to fetch another order.

Independent processes

Clearly, we can improve both of these scenarios by re-designing the application into two independent processes, separating downloading from importing. If we do that, the download process would be able to easily download a couple of hundred orders in less than a minute, and you are likely to have an empty order queue at the FTP when the network goes down. Then the importing process could churn away on two orders per minute during the two hours of network maintenance.

Even if we were not that lucky, and hundreds of lunch orders was placed at the web shop at around 11 (temporarily out of reach from your application) the “manual move” USB flash drive trick would then be simple to perform if needed, just a matter of dumping a batch of order files into the application processing folder.

Either way, this design is way more robust than the naive implementation, regardless of how much error handling is put into the latter! Now, we are likely not to even notice shorter periods of downtime.

At an infrastructural level, you may also be able to improve robustness by installing network devices that can fallback to a secondary internet connection, if the primary connection fails. Note that this is no replacement for separating downloading from importing, however, as there are still scenarios where the separation is beneficial, even if the network would be 100% reliable.

Series Navigation<< A Robust SystemImplementing for Robustness >>

Leave a Reply

Close Menu