Spreadsheets have been much in the news over the last couple of days in the UK, and not for good reasons. It looks as though Excel’s limit on the number of rows of data possible in a spreadsheet has led to Public Health England misplacing more than 16,000 Covid test results. The implications of this mistake are horrendous to consider. 50,000 potentially infectious people who should have been self-isolating may not have been contacted.
This is not the first time that spreadsheets have been implicated in significant errors that have turned out to be costly, both financially and in human terms. In 2013 JP Morgan inadvertently masked the loss of almost $6 billion due to a mistake in a spreadsheet cell that led to dividing the sum of two interest rates rather than the average.
A similar error in a formula led to the authors of an influential 2010 economics paper to incorrectly conclude that high levels of public debt slows economic growth. This conclusion was used by both the IMF and the UK government to justify extensive austerity programmes over the following decade, resulting in poverty, lost jobs and even riots. Spreadsheets have also been implicated in the collapse of the Jamaican banking system in the 1990s, as well as the worldwide financial crash of 2008
Despite their widespread use, relative familiarity, general convenience and flexibility, spreadsheets should not be used as the cornerstone of any analytical process, especially if the data is mission critical, sensitive in any way or part of an activity that has to manage large volumes of data (by which I mean anything more than a few hundred rows). Here’s why.
Spreadsheets aren’t suited to large datasets
Spreadsheet aren’t designed to handle large data sets (based on Microsoft’s own technical specification the maximum file size is 1,048,576 rows by 16,384 columns although we have witnessed many cases where reliability and performance begin to falter way before data gets to this size). This seems to be at the heart of the problem with the Covid testing data. Data from multiple CSV files was being fed into a central spreadsheet and those managing the process failed to notice when the number of cases exceeded the available rows in the sheet and any subsequent records simply weren’t added to the sheet.
Spreadsheets are a poor choice for blending data from multiple sources
Excel is a poor choice when you need to automate analysis or import data from a variety of sources—all routines that CV19 track and trace process presumably has to do. Cleaning up data sets of any size, especially larger data sets of 1000s of cases can also be time-consuming and tedious in a tool like Excel, not to mention that it’s easy for errors to be introduced into the data during the clean-up process. With the manual and often clumsy import abilities spreadsheets offer, processing times can often be very slow when compared to dedicated analytical platforms and tools.
Spreadsheets don’t provide a proper audit trail
Perhaps most pertinently, spreadsheets are not appropriate for historical data storage. When a value in a cell is updated, any record of what that cell contained previously is immediately and permanently lost. Such data loss creates problems in data analysis and auditing, making it difficult ensure the integrity of data (as so disastrously demonstrated by the Track and Trace data handling procedures).
It’s also very easy, as shown in the examples I mention above, to introduce an error into a spreadsheet without any way of picking it up. The spreadsheet spits out the results of the formulae that humans enter. Too often those results are treated as 100% reliable and carved in stone without taking the time to check and double check that the formulae that produce them are correct.
This fundamental flaw in ensuring data integrity also compounds doubts over the robustness of any downstream analysis and reporting to identify trends and evaluate performance. Given the complete absence of an audit trail the only way to mitigate this risk is to rely on multiple copies of the same spreadsheet from different periods saved and stored separately. This creates an entire new set of problems, not least of which is human error, when having to manage data integrity.