Recurse.se Yada yada on Software Development

22Oct/110

Principles of Software Design

This entry is part 1 of 8 in the series Design Principles

Software should be well designed, right? What then is well designed software? I'd say it's much easier to spot badly designed software, than to postulate what constitutes well designed software. Nevertheless, I'm going to give it a try. I'll also try to provide illustrative examples and justifications for my arguments.

However, I think we all can agree that the answer to what constitutes good design depends heavily on the context and purpose the software is developed for.

Are we tasked with building a service without a user interface, a web or desktop application, a game or an operating system, a smart-phone app, or an embedded system? Will it be stand-alone or part of a larger system? Will it be single user or multiuser, a commercial product or an in-house solution, will it run unattended, is it life-critical? Will it handle 100 KB or 100 TB of data per day? What is the cost of failures? How much data loss is acceptable? How much downtime is acceptable? What are the performance goals?

Design decisions that would be considered "good" (or "good enough") for one kind of system, might be detrimental to another.

Target software

So let's define what target software I'm considering in this discussion of "good design".

I'm most familiar designing or reviewing software in what we can call the "medium to large" scale, where the systems are comprised of several interacting components, such as desktop or web applications, web and windows services, and relational databases, loosely connected together in order to form a whole cohesive system.

My experience is that most systems have very few, if any, constraints that are really really hard, which must never be broken. This means in practice that nothing will be life critical, and the cost of failure is not overwhelming. A small amount of downtime may also be acceptable by the users. We are probably not dealing with data sets of hundreds of terabytes, and are not handling extremely sensitive or secret information.

The principles I want to discuss may also not be equally valid for very large scale systems such as MMORPGs, or the very small hobbyist scale such as the Sudoku solver you write for fun, but will in my opinion still be applicable to a rather wide range of enterprise or business software, from single user desktop applications, to web sites with tens of thousands of users, or business critical information processing services running as windows services or internet-facing web services.

The principles

Now let's get closer to the actual design discussions, shall we?

Experience have taught me a few things about software design, and I think that there are a number of desirable properties that we'd like our system to have. The following lists some of the most important ones:

  • Functional - able to provide the required functionality.
  • Testable - so we can ensure the system does what we want it do to.
  • Robust - able to handle and recover from failures.
  • Monitorable - the status of the system can be observed from the outside while it is running.
  • Deployable - easy to move from the development environment to the production environment.
  • Scaleable - able to gracefully handle a growing amount of data and users.
  • Adaptable - able to handle changes in the requirements or run-time environment.
  • Efficient - able to make the most out of the available computing resources.
  • Elegant - attains perfection not when there's no more to add, but when there's no more to remove.

My plan is to expand on the characteristics of these individual design properties in a series of future posts. I could probably expand on this list, but these are the really important ones, and we have to start somewhere, don't we? If I come up with more desirable properties, I'll have to come back to them later, or I'll never get started. This is a blog, not a book, after all.

Feedback from users or by observation is invaluable for the evaluation of a design (be it your own, or any design you know and understand well). Negative feedback on the design is a powerful motivator to design something better next time. Positive feedback helps reinforce the good decisions made, and we are likely to try to re-use good ideas the next time around.

Filed under: Design No Comments
23Oct/110

A Question of Scale

This entry is part 2 of 8 in the series Design Principles

In the previous post we decided that there are a number of desirable properties that we'd like a software system to have. I'll quickly repeat the ones I mentioned; A system should be Functional, Testable, Robust, Monitorable, Deployable, Scalable, Adaptable, Efficient and Elegant.

(There's probably ample opportunity to create a catchy and memorable abbreviation out of these words, but let's not.)

These properties can be achieved at different scales, or levels. By that I mean that they are affected by architectural (large scale) and design (medium scale) decisions as well as implementation (small scale) decisions.

So, in this series of articles, I am using the word "design" in a very broad sense, and not something limited to the designer role. These design principles is about reasoning how to create software, not just by hacking away on the problem at hand, but instead by having a plan at a higher level, describing how the system should be built to get a professional result by design (i.e. on purpose, not by accident). This plan should then affect everyone on the development team.

I could have called this series "friendly advice on how to increase the level of professionalism in your software development effort", but choose to use the word "design" to the same general meaning.

Role playing

First I'd like to elaborate a little on the differences and interconnections between the roles of architect, designer and implementer, as I will not place much focus on these individual roles later in this series.

I consider these roles to be more or less a matter of on which scale one operates on. The architect considers issues on the largest scale, how the whole will operate as a function of the major components, while the implementers (craftsmen) considers the issues on the smallest scale, performing the practical construction of the needed parts according to the designs.

The designer and architect roles are really only needed because sufficiently large systems have far too many "moving parts" for any one person to understand the entire system in any detail, and it becomes more efficient having some people focus on the general higher level goals, and delegate the lower level details on how to reach specific goals to others focusing on their part of the system.

How are these things connected?

The smaller the system, the less need for designer and architect skills on the team. If you are building a dog house, you don’t need to be an architect. However, as soon as the system complexity increases and it becomes composed of several cooperating services or applications, those skills become important. It’s hard to rescue bad architecture with good design, or a terrible design with good implementation. (On the other hand it's simple to wreck even the best designs with a lousy implementation, anyone can do that!)

Analogies are often made between software construction and building construction. I guess that's because building construction is easy to visualize, and that it shares roles and questions of scale with software construction.

The Sydney Opera House under constructionSo, let's for a moment consider the Sydney Opera house. Surely an example of grand architecture! But is it also an example of grand design, and grand implementation?

I have no idea, but my point here is that at smaller scales, it could be a real mess! It does not follow that anything is superbly implemented at lower levels of detail, even though we can agree it is grand in the larger scales.

For example, the electrical wiring could be installed in an amateurish and unsafe manner, and placed where there are no easy access for maintenance workers.

The white ceramic tiles (manufactured in Sweden, by the way) could have been unprofessionally set on the roof in a way that let water in underneath, letting rust and mold starting to do damage to the building.

So the architect and designer(s) could have been doing a good job, but the craftsmen could have been amateurs, causing a whole lot of problems, that may last for the entire lifetime of the building.

And even if the architecture makes it a fantastic landmark, it could be that it is not optimally functional for an opera house. The Wikipedia entry indeed suggests that there are problems with the acoustics in the two major halls. But apparently the customer wanted both a landmark and an opera house, and then compromises was made.

This leads us to my next post in this series, where I’ll get to the first, and perhaps most important property of any system; being functional!

Filed under: Design No Comments
2Nov/110

A Functional System

This entry is part 3 of 8 in the series Design Principles

Focus: Functionality
Description: Make sure the system meets the stated requirements!

Surprisingly, when discussing design, it's easy to forget about perhaps the most important design criteria, namely that the system should be able to provide the required functionality! I actually didn’t include this in my first draft of important design criteria.

Perhaps it is thought that the necessary functionality is expressed solely in the requirements specification, and then realized by the implementation, and that the design is in some way orthogonal to this. However, a sloppy system design can ruin even the most well-specified requirements. If nothing else, it might set some unfortunate decisions in stone (well, code), and make other requirements hard or impossible to meet.

Examples

That quick and dirty hack you did a couple of months ago – you know, the one where you remove obviously invalid customer email addresses on the incoming orders from the new web shop (because those caused issues at your end, and the consultants had far more pressing issues than improving email validation rules).

You know the hack doesn’t catch all possible errors, but it clearly improved the situation, and has been running OK since day one. Even if that hack had no other of the desirable properties we are discussing here, it was functional, and provides value, or you would've removed it from production.

And that new web shop, if it isn’t functional – if it does not even provide the basic shopping functionality that your company wanted it to have – then it’s in effect worthless! This holds even if it would have a whole host of other properties (including some less desirable properties such as costing too much).

Obviously, a more complex system can provide value, even though it does not do everything exactly the way the customer wanted. It may have bugs, or a huge misunderstanding that makes one module clearly less useful than intended, but it can still be functional as a whole.

How to achieve functionality by design

The most important tools for achieving functionality are:

  • A good requirements specification. (Hah! as if that’ll ever happen!)
  • Good communication with the stakeholders (the customer, the users etc.)
  • A good understanding of the problem domain. (Domain knowledge.)
  • General development experience. (The more, the merrier!)
  • Several releases of the system. (Since you will never get it right the first time!)

It is important to note that a requirements specification, or even communication with the customer, will never in itself give the whole picture. At least, I’ve never been on such a project, and I’m pretty sure I never will. This is because the customer won’t be able to get every tiny detail down on paper, and won’t even be able to tell you everything during meetings.

The more complex the system, the less likely you are to get the whole picture up front before starting development. In all likelihood, no one (including the customer) will even understand the full system completely at that point, so understandably they cannot give you all the details.

It is also important to realize that it is very common that many quite important requirements are not actually stated explicitly in the requirements specification! There is a whole category of requirements that the customer often leave implicit and unstated, but that we should nevertheless understand if we want to build a great system. More on this in the next post!

Filed under: Design No Comments
5Nov/110

Non-functional requirements

This entry is part 4 of 8 in the series Design Principles

Focus: Functionality
Description: Make sure system meets even the unstated requirements!

If you are dealing with a customer with little experience of specifying software requirements, they are likely to only include functional requirements in the specification. They will likely leave out most or all of the non-functional requirements. I’m not talking about things that doesn’t work, I’m referring to a requirements that does not directly specify functionality the system should provide, but instead specify how the system should be, describing a quality of the system. These requirements, even though often unstated, are often crucial to the eventual success of the system.

Most of our other system properties are non-functional in nature, and thus I’m arguing that capturing these non-functional requirements are the key to actually providing the customer with a high-quality system, making it functional by design.

If we’re building a web-shop, there’s likely a requirement to send each order to the warehouse for picking. This is an entirely functional requirement. There could be multiple unstated non-functional requirements hidden in this requirement.

  • It does not say what delays would be acceptable (from the customer placing the order to it being available for picking in the warehouse) - a non-functional requirement related to performance.
  • It also might not say anything about the importance of not crashing or stalling upon trying to import an incomplete order (say one without an OK delivery address, or for obsolete items) - a non-functional requirement related to robustness.

This web-shop system can comply completely to the functional requirements, but still infuriate the customer because it doesn’t fulfill the (unstated) non-functional requirements. Consider if it hangs when receiving a single incomplete order, blocking further order processing, needing manual intervention to restart, or if orders from the web-shop are being imported to the warehouse system at a rate far slower than the customers are placing them at the web shop during peak hours, causing unwanted delays in picking.

Obviously, the best would be if all requirements were stated, instead of unstated, but in my experience customers seldom includes non-functional requirements. Of course, that doesn’t stop them from being unhappy when the system does not meet the unstated requirements (“but it’s obvious this shouldn’t take this long!”)

Don’t underestimate the power of the stuffing

Possibly the customer hasn’t even fully considered all these details when you are starting to build the system. Since not all details are known in advance, you (or your team) will have to fill the gaps with some “stuffing”, when you get down to that level of detail. Very often this stuffing corresponds a great deal with any unstated non-functional requirements.

The quality of the stuffing you and your team fill the system with, will be directly proportional to your understanding of the problem (and problem domain), and also to your development experience. Yes, you can ask questions, but if you ask too many, you’ll start getting contradictory answers or annoy your customer who had the hope you would be able to take sensible action on your own.

Domain knowledge and experience will tell you when the customer says one thing but really means another. It will help you decide to make configurable those things that are likely to change (even if the customer didn’t explicitly tell you so). It will help you make the right decisions, and guide your development towards a system that is not merely implementing each requirement, but is functional by design, and where our other system properties have been considered, and added in an amount appropriate for the system you are building.

The importance of experience and domain knowledge should not be underestimated! A team of developers, lacking either experience or domain knowledge are likely to get several things wrong at first. A team lacking both are more or less guaranteed to get most things wrong, and are even unlikely to ever get a moderately complex system to a point where it’s mostly working. This is because the inexperienced team hasn’t yet learned to consider the unstated non-functional requirements, but hopes to achieve full functionality by more or less independently working their way through implementing the requirements to the letter, one by one.

This reminds me of an assignment in programming class at university, where we were implementing a text editor, and had a requirement to warn the user of unsaved changes if he tried to exit. Some students then simply let the editor exit with a message like “Warning, you had unsaved changes” printed on the console.

Filed under: Design No Comments
13Nov/110

A Robust System

This entry is part 5 of 8 in the series Design Principles

Focus: Robustness
Description: Make sure the system is able to handle and recover from failures

Robustness is a very important property of a system, especially for unattended and/or long-running operations. You should be able to estimate for each module, component, or function you are building, what the need for robustness is, and the things that are most likely to go wrong. You then have to make sure those things do not compromise your system’s overall functionality in a manner more severe than the error actually calls for.

There’s no reason your system shouldn’t be able to gracefully handle at least all of the common errors, most of the anticipated errors, and even a lot of the unanticipated errors that can, and will, occur at run-time. A small, commonly occurring error should not be able to cause a lot of trouble!

An Example

Let’s say you are employed by the FruutXpress company that is in the business selling fresh fruit with express delivery to customers in your city via their web shop. The customer orders needs to be shipped pronto, and there’s a warehouse full of fruit and 30 workers to pick the incoming customer orders into boxes and onto the delivery trucks.

Your task is implementing a program that imports the orders from new FruutXpress web shop, into the company warehouse system. We are told that the incoming orders are available as XML files, one order per file, on an FTP server for your import program to read. You will use the warehouse system’s rather simple API, that will let you query and import data.

On the surface a rather simple task, right? I mean, how hard can it be, just reading a couple of lines of some text file containing how many of which fruits have been ordered, a delivery address, and not much more, and then importing this into the warehouse system (which then can print out picking lists, shipping labels, and whatever else is needed for the warehouse personnel to do their work).

However, when thinking about it some more, you realize it’s a rather important function that is absolutely critical for the FruutXpress company. If your program would stop working, the customers would still place orders at the web shop, but the orders will never reach the warehouse system, and the delivery trucks and workers would sit idle, because they wouldn’t get any new orders to pick. Or rather, they would be running around screaming about probably having to work overtime in order to fulfill all pending deliveries, and you’d have warehouse management on the phone within minutes complaining!

Clearly you must design and build a really robust import program! It’s important that it does its job well, and only rather catastrophic errors, such as the network being down, or a server crash, should be able to force it to abort processing. It should be able to keep processing orders, even in the face of many different error conditions.

More about this in the next post.

Filed under: Design No Comments
21Nov/110

Designing for Robustness

This entry is part 6 of 8 in the series Design Principles

Focus: Robustness
Description: Robustness is not just about exception handling! System architecture and design plays an important role!

In the last post, we introduced an example of an application that clearly needed to be robust. Now we need to figure out how to make design decisions that help us achieve that goal.

A naive implementation of the order import application from the last post - fetching orders from a web shop via FTP, importing them into the local warehouse system - is something like the following pseudocode;

    void ImportOrders() 
    {
        do { 
            FTPServer webshopFTP = FTPServer.Open(webshopFTPaddress, user, password);
            foreach (RemoteFile remoteFile in webshopFTP.Files)
            {
                XmlDocument orderDoc = XmlDocument.Load(remoteFile.DataStream);
                ImportOrder(orderDoc);
                remoteFile.Delete();
            }
        } while(!doomsday); 
    }

Some may think that making the processing robust simply amounts to adding a try-catch inside the inner loop in order to catch per-file errors, and another try-catch inside the outer loop. While I agree that it would indeed make this implementation considerably more robust, it’s a bit like thinking that adding sprinkles to your vanilla ice-cream makes it a five-star dessert! Its better, but there’s certainly room for improvement!

We need to consider robustness before rushing to do implementation. We need to consider the various error scenarios. For each scenario we need to consider our options to improve robustness in the face of errors, and find an appropriate strategy to handle the error, such as retrying, fallback to an alternate strategy, ignore the error and continue, or throwing an exception, aborting processing.

Let’s start with one of the most likely sources of errors, the network connection to the FTP Server.

Unnecessary dependencies

So, your application might not be able to reach the FTP Server at all times. No amount of try-catch will help you reach it when the network is down. But what would improve robustness, then? Well, if we did not need to reach it at all times!

One important thing to realize is that the naive implementation needs both the FTP Server and the Warehouse system up and running at the same time, in order to import anything. It may seem like they obviously must both be up, because we are in a sense moving orders from the FTP to the warehouse system. But this is not really true, and this misunderstanding has a negative effect on reliability.

Imagine what would happen if for some reason the network connection to the FTP server was unexpectedly lost for a couple of hours. Maybe the FTP Server is located within driving distance, and someone in desperation fills a USB flash drive with hundreds of orders from the web shop and drives to the warehouse, handing over the drive to you. Unfortunately, this would be of no immediate help, as the naive application would not be able to process these orders from the removable drive, as it only knows how to fetch orders from an FTP Server! Sure, if you’re lucky you might be able to do an rather stressful ad-hoc set up of a local FTP Server and re-configure your application to use that, but there are other problems with this design.

This version of the application is also always downloading and importing one order at a time. Let’s assume that downloading is rather quick, and takes no more than a tenth of a second per order, but importing into the warehouse system is rather slow, lets say around 30 seconds per order. If you were told well in advance that the warehouse external connection will be down between 10 to 12 for maintenance, your application would still not be able to do any processing between 10 and 12. Even if there were hundreds of orders waiting at the FTP Server by 9:50, your application would only manage to fetch and process about 20 orders until 10:00, and then processing would stop when the application was unable to fetch another order.

Independent processes

Clearly, we can improve both of these scenarios by re-designing the application into two independent processes, separating downloading from importing. If we do that, the download process would be able to easily download a couple of hundred orders in less than a minute, and you are likely to have an empty order queue at the FTP when the network goes down. Then the importing process could churn away on two orders per minute during the two hours of network maintenance.

Even if we were not that lucky, and hundreds of lunch orders was placed at the web shop at around 11 (temporarily out of reach from your application) the “manual move” USB flash drive trick would then be simple to perform if needed, just a matter of dumping a batch of order files into the application processing folder.

Either way, this design is way more robust than the naive implementation, regardless of how much error handling is put into the latter! Now, we are likely not to even notice shorter periods of downtime.

At an infrastructural level, you may also be able to improve robustness by installing network devices that can fallback to a secondary internet connection, if the primary connection fails. Note that this is no replacement for separating downloading from importing, however, as there are still scenarios where the separation is beneficial, even if the network would be 100% reliable.

Filed under: Design No Comments
11Dec/110

Implementing for Robustness

This entry is part 7 of 8 in the series Design Principles

Focus: Robustness
Description: Set robustness goals, and consider what should happen, and what should not happen in the face of likely (and less likely) errors.

In the last post, we were able to increase robustness, by modifying the design of an order import application separating the process of downloading orders from the process of importing them into the warehouse system (see previous posts for more details on the example we're working on). The redesign required some understanding on the existing design’s effect on the robustness of the system, and some experience to decide what to do about it.

It may be argued that we traded complexity for robustness as we split one component into two, as we need to add some glue to integrate the two components. However, each of the two new components are now conceptually simpler and smaller than the combined one they replace, and it should therefore be simpler to implement each part in a robust manner, as now error handling for the download process does not need to be intermingled with error handling for the import process. Also the glue needed in this case is more or less just a shared folder. Modularization and componentization of your software is therefore often a good thing when it comes to robustness.

Now lets consider the implementation of each component, in this post we take a look at the the downloading process, and we leave the importing process for a later post. Please note that I’m going to have to show simplified implementations, or they will become way too long to reason about in a single post. But I hope they will be able to get the message across anyway.

We’ll start by setting some robustness goals for our processes.

  1. Errors should not cause the processes to exit or stop processing.
  2. A single order file should not be able to block further processing.
  3. Orders should never be permanently lost.
  4. All correct orders should be imported exactly once (not zero, nor twice).

These goals will help us take appropriate action when handling errors.

It will be rather impossible to guarantee we’ll meet all goals, especially the first one, regardless of which errors will occur. In the face of some errors, such as the complete loss of a necessary resource such as disk space (or power), or an access denied failure that needs manual administrator intervention to correct, we may have no other option but to stop processing. However, we should hopefully be able to resume processing when the error condition is cleared, without needing a restart.

Also, if we suffer a hard drive crash after downloading orders, but before importing them, some order will probably get lost. There’s always a trade-off between the cost of failure, and the cost of preventing failure.

The Order Downloading Process

How can we make the downloading robust? We need to consider the problems that could occur. Some likely errors are:

  • Network issues, that could temporarily prevent access to FTP Server, or affect downloads in progress, or make the file server unreachable.
  • Permission issues, such as not being able to delete a downloaded file from the FTP, or even log in to the FTP.
  • File system issues, such as disk full, file already exists, or access denied errors.
  • File format issues (empty files, unrelated files placed in the download folder).

(and yes, there are lots more of potential issues, but to list more takes focus away from what I want to discuss.)

We will not try to interpret the files when downloading them, as that task is better left for the import process, that clearly must understand the contents of the order anyway.

  1:     void DownloadOrders()
  2:     {
  3:         do {
  4:             try {
  5:                 FTPServer webshopFTP = FTPServer.Open(webshopFTPaddress, user, password);
  6:                 foreach (RemoteFile remoteFile in webshopFTP.Files("Order*.xml"))
  7:                 {
  8:                     DownloadFile(remoteFile)
  9:                 }
 10:             } catch(Exception ex) {
 11:                 Log(ex);
 12:             }
 13:             Pause(retryDelay); 
 14:         } while(!doomsday);
 15:     }
 16:     
 17:     void DownloadFile(RemoteFile remoteFile)
 18:     {
 19:         try {
 20:             string tempFile = DownloadToTemporaryLocation(remoteFile);
 21:             MoveToOrderImportFolder(tempFile);
 22:             // if it cannot be deleted from the FTP, it will be downloaded again later
 23:             remoteFile.Delete();
 24:         } catch(Exception ex) {
 25:             Log(ex, remoteFile);
 26:         }
 27:     }

In order to meet goal 1, we’ll simply add an outer catch-all exception handler (line 10). We also add a short pause to the outer loop (line 13), not to overload any services, if there’s an error.

In order to meet goal 2, we will use a file mask (Order*.xml, say) to include only relevant files. Even if we have a folder on the FTP server dedicated to the order files, an irrelevant file might accidently be placed there and this could potentially interfere with processing - say, a 12GB file called ‘webshopdb.bak’, or thousands of image files, would likely cause issues and, if nothing else, consume time and download bandwidth unnecessarily. We’ll also add an inner catch-all exception handler (line 24) to catch errors while processing individual files.

In order to meet goal 3, we will only delete files from the FTP once we are certain we’ve successfully stored a local copy (we’ll reach line 23 only if line 20 didn’t throw an error).

As the code is written now, if we cannot delete a file from the FTP once downloaded, we will download many copies of this order and place in the import folder (and keep doing so until it can be removed from the FTP). This means that the import process must be able to handle duplicate orders, or we’ll violate goal 4! If we instead tried deleting before calling MoveToOrderImportFolder, undeletable files would delay the orders instead of creating duplicates, so it’s a trade-off.

We’ll take a closer look at the order import process in the next post.

Filed under: Design No Comments
5Mar/120

Implementing for Robustness (cont’d)

This entry is part 8 of 8 in the series Design Principles

Focus: Robustness
Description: Set robustness goals, and consider what should happen, and what should not happen in the face of likely (and less likely) errors.

In the previous post we set up robustness goals, and implemented the first half of the order import functionality, performing downloads of orders via FTP. In this post we are going to complete the example by implementing the second half, which imports the orders into the warehouse system.

What we need to do

An outline of the functionality needed for the import process is as follows:

  • The download process leaves the order files it has downloaded in a folder where the import process takes over ownership of the files.
  • We are going to import the orders oldest-first.
  • If there’s an error processing an order, we’ll signal an error, move the file to the error folder and continue with the next order.
  • We are going to archive each order file in an archive folder after successful import.

Avoiding starvation and false prioritizationTiming Issues

Why do we care about in which order we do the import, what has that to do with robustness, you might ask. Aren’t we simply going to import them all, so why care about ordering? The answer is that since we said import takes a good 30 seconds per order, we may create a so called “starvation” situation if newer order files gets processed before older order files are chosen, such that some orders might be delayed “indefinitely”. When the orders are few and processing time is abundant, any strategy will do. But when there’s enough orders so that it will take a significant amount of time to import them all, we will have problems without an ordering strategy.

For example, if we’re importing order files in “unspecified” order (which in case of .NET Directory.GetFiles on NTFS would be alphabetical order) and the naming scheme for the order files happens to be such that, say, the name starts with a two digit area code. We will in this case inadvertently cause a false prioritization for lower area codes over higher area codes.

To understand what ill effects this might have, let’s say there’s 120 new order files for lower area codes, and one older file for a higher area code – this will then then delay the order from the higher area code by around an hour, even though it was “first in line” before the other orders arrived! Since we’ve made promises to the customers regarding delivery times, the longer it takes for an order to be imported, the less time warehouse workers will have to pick the ordered goods, and we might not be able to fulfill the delivery in time.

Note that this also applies to the downloading process, as it wouldn’t do much good if we implemented “oldest-first-import” if downloading also falsely prioritize some orders. However, as we’re assuming that the downloading process is quite fast (magnitudes faster than importing) the order it uses is of little concern to us.

Processing errors

The kind of errors we can expect from order processing can roughly be placed in one of three categories:

  • Errors from the file system (reading, archiving)
  • Errors due to order contents (format issues, content validation issues)
  • Errors from the warehouse system API we’re using (validation issues)

Sample code

The presented code is kept as short as possible - possibly even too short - but I find too much code takes focus from what I actually want to discuss, which really isn’t the exact implementation, but the need to do robustness by design, not by accident.

 1 public static void ImportOrders()
 2 {
 3     do {
 4         string orderFile = null;
 5         try {
 6             while ((orderFile = GetNextOrderFile()) != null)
 7             {
 8                 ImportAndArchiveOrder(orderFile);
 9             }
10         } catch (Exception ex) {
11             Log(ex, orderFile);
12         }
13         // only pause if there is an error, or there are no orders to import
14         Pause(RetryDelay);
15     } while (!doomsday);
16 }
17 
18 private static void ImportAndArchiveOrder(string orderFile)
19 {
20     bool success = false;
21     try {
22         success = ImportOrder(orderFile);
23     } catch (Exception ex) {
24         Log(ex, orderFile);
25     }
26     ArchiveOrder(orderFile, success ? ArchiveFolder : ErrorFolder);
27 }

Care has to be taken in the implementation of GetNextOrderFile() not to commit to a huge number of files, as this can cause starvation, since it also translates into a huge amount of processing time before the next decision on which files to import. It also should not re-read the directory to find the oldest one at each call, as we then (in case of a single very old unprocessable, unmovable file) would try to import the same order over and over again.

But why pause?

I include a pause here and there in the sample code. Why is that a good idea? Well, if you’ve seen eventlogs filled with tens of messages a second all stating the same unexpected error (“Access Denied” say) you know that if some error bubbles up to the top level, it can be a good thing to allow processing to take a short break, if nothing else to avoid overloading the logs, and possibly other systems.

If your code normally processes two orders a minute, an unforeseen error situation can force it to do one lap around the processing loop much much quicker than you anticipated, and the code may then start to hammer some other service with requests because you expected the normal processing to provide a natural pause between such calls.

Just a few days ago, I saw a service almost bring a server to it’s knees because of an “Access Denied” error, which made the service unable to remove a file, and it therefore tried processing the file over and over again. The processing needed for a single file took a significant amount of resources, and when it did this over and over – without the slightest pause – it really stressed out the server. Had there been a top level pause of even 10 seconds, the service would not have overloaded the server due to this error, and processing would be only very slightly delayed by this, as it really only pauses when it has nothing actual processing to do.

Improvements

An improvement to order-first would be to import in time-to-requested-delivery order, but this would be more complicated to get right, as orders then wouldn’t arrive in anything near the order we wanted to import them in, and we would also have to read the order files to get at this information. Oldest-first is good enough at this point, and certainly more robust than importing in “whatever” order. However, the implementation above does not care about the actual ordering method used, as that is abstracted away in the implementation of the GetNextOrderFile method.

There are likely lots of other improvements to the code, as it’s really just pseudo code.

Filed under: Design No Comments