The ever-increasing quantities of data available for processing are leading to a revolution in research and analysis. Each year a new high in the volume of data is reached, we now speak in exabytes (1000^6 bytes) and even zettabytes (1000^7 bytes) and yottobytes (1000^8 bytes). Many home computers have drives with a capacity of one or more terrabytes (1000^4 bytes).
A far cry indeed from the 5.25 inch floppy disk drive (360 kilobytes) or 3.5 inch floppy disk (1.44 megabytes). You would need 2.8 trillion 5.25 inch floppy disks to store one exabyte. Actually a lot more, since they had a reasonably high failure rate. To put this in perspective, that pile of 5.25 inch floppy disks would stand 5.5 million km high.
Of course this vast amount of data doesn’t necessarily make the quality of analysis any better. As I wrote earlier, the easy access to various econometric and statistical analysis packages has led to a growth in extremely poor analysis, and we risk a flood of papers which make assertions not supported by data. Or papers that are riddled with fundamental errors because the authors do not understand statistical and econometric analysis. Being able to use a calculator doesn’t mean one can multiply. Similarly, being able to use, say, Shazam, is no guarantee that a person understands data analysis. Just as teachers should demand of their students proof that they can perform basic arithmetic using a paper and pen, so too should data analysts demonstrate their ability to perform econometrics without a computer.
What is required is some computer package that can automatically analyse a paper for errors.
But the idea of data deluge offers many promises. By connecting data from different sources, with different attributes, novel and interesting findings can be found in numerous fields, including traffic flow, crowd behaviour, and the like.
In this, the Australia Bureau of Statistics is far from helpful. Behind the shield of privacy, Brian Pink’s ABS is an organisation that has a wall of data it refuses to release. The ABS is like a black hole – sucking up information which is never seen.
John Kay rightly observes that weather forecasting has improved dramatically over recent years thanks to the vast amount of data and processing power. As Kay notes
In 1987 Michael Fish went on television to reassure viewers rumours of an imminent hurricane were unfounded. A few hours later the most severe winds in decades lifted roofs and felled trees all over Britain.
But such a blunder is much [less] likely now. Short-term weather forecasting is one of the triumphs, perhaps the greatest triumph, of big data – the opportunity supercomputers provide to process data sets of unbelievable size and complexity. I understand that the latest machines can handle an exabyte of data, which is about 20m times the capacity of my Apple Mac. The British Meteorological Office claims that its three-day forecasts today are as accurate as its one-day forecasts were in the heyday of Mr Fish (which is perhaps not the most reassuring way of describing their improved performance).
But Kay has a strong caution. While short-term weather forecasting is more reliable, that reliability drops off rapidly. We can be pretty sure of tomorrow’s GDP, but not next year’s GDP. Or as Kay says
Big data can help us understand the past and the present but it can help us understand the future only to the extent that the future is, in some relevant way, contained in the present.
And that, ladies and gentlemen, is the chimera that climate models seek. That future is not in any way contained in the present. It is a chaotic system, impossible to forecast. Let’s divert that vast computer network, presently devoted to a hopeless pursuit, to more near-term analysis which has relevance to the present and can provide helpful findings.