Data deluge

The ever-increasing quantities of data available for processing are leading to a revolution in research and analysis. Each year a new high in the volume of data is reached, we now speak in exabytes (1000^6 bytes) and even zettabytes (1000^7 bytes) and yottobytes (1000^8 bytes). Many home computers have drives with a capacity of one or more terrabytes (1000^4 bytes).

A far cry indeed from the 5.25 inch floppy disk drive (360 kilobytes) or 3.5 inch floppy disk (1.44 megabytes). You would need 2.8 trillion 5.25 inch floppy disks to store one exabyte. Actually a lot more, since they had a reasonably high failure rate. To put this in perspective, that pile of 5.25 inch floppy disks would stand 5.5 million km high.

Of course this vast amount of data doesn’t necessarily make the quality of analysis any better. As I wrote earlier, the easy access to various econometric and statistical analysis packages has led to a growth in extremely poor analysis, and we risk a flood of papers which make assertions not supported by data. Or papers that are riddled with fundamental errors because the authors do not understand statistical and econometric analysis. Being able to use a calculator doesn’t mean one can multiply. Similarly, being able to use, say, Shazam, is no guarantee that a person understands data analysis. Just as teachers should demand of their students proof that they can perform basic arithmetic using a paper and pen, so too should data analysts demonstrate their ability to perform econometrics without a computer.

What is required is some computer package that can automatically analyse a paper for errors.

But the idea of data deluge offers many promises. By connecting data from different sources, with different attributes, novel and interesting findings can be found in numerous fields, including traffic flow, crowd behaviour, and the like.

In this, the Australia Bureau of Statistics is far from helpful. Behind the shield of privacy, Brian Pink’s ABS is an organisation that has a wall of data it refuses to release. The ABS is like a black hole – sucking up information which is never seen.

John Kay rightly observes that weather forecasting has improved dramatically over recent years thanks to the vast amount of data and processing power. As Kay notes

In 1987 Michael Fish went on television to reassure viewers rumours of an imminent hurricane were unfounded. A few hours later the most severe winds in decades lifted roofs and felled trees all over Britain.

But such a blunder is much [less] likely now. Short-term weather forecasting is one of the triumphs, perhaps the greatest triumph, of big data – the opportunity supercomputers provide to process data sets of unbelievable size and complexity. I understand that the latest machines can handle an exabyte of data, which is about 20m times the capacity of my Apple Mac. The British Meteorological Office claims that its three-day forecasts today are as accurate as its one-day forecasts were in the heyday of Mr Fish (which is perhaps not the most reassuring way of describing their improved performance).

But Kay has a strong caution. While short-term weather forecasting is more reliable, that reliability drops off rapidly. We can be pretty sure of tomorrow’s GDP, but not next year’s GDP. Or as Kay says

Big data can help us understand the past and the present but it can help us understand the future only to the extent that the future is, in some relevant way, contained in the present.

And that, ladies and gentlemen, is the chimera that climate models seek. That future is not in any way contained in the present. It is a chaotic system, impossible to forecast. Let’s divert that vast computer network, presently devoted to a hopeless pursuit, to more near-term analysis which has relevance to the present and can provide helpful findings.

About Samuel J

Samuel J has an economics background and is a part-time consultant
This entry was posted in Technology & Telco. Bookmark the permalink.

35 Responses to Data deluge

  1. Ant

    What is required is some computer package that can automatically analyse a paper for errors.

    And plagiarism.

  2. JohnA

    Ant #1186465, posted on February 12, 2014 at 7:43 am

    What is required is some computer package that can automatically analyse a paper for errors.

    And plagiarism.

    Worth a chuckle, but not worth a candle as serious analysis.

    Sorry to wax philosophical, but you are asking for the ability to automate a measurement against an external standard – Truth.

    And a second external standard – previous work in the same field.

    This will always require the application of wisdom, not merely brain power.

  3. sabrina

    Specialised softwares to detect plagiarism or salami slicing are available, used by established journals, also used by some universities.

    And… It is chaotic system, impossible to reliably forecast with the current level of knowledge.

  4. Samuel J

    No, Sabrina. The definition of chaos is that it is impossible to forecast with any level of knowledge.

  5. It is a chaotic system, impossible to forecast

    Don’t assume. Somethings are just more difficult than others.

  6. Rabz

    The British Meteorological Office claims that its three-day forecasts today are as accurate as its one-day forecasts were in the heyday of Mr Fish

    They can claim whatever they bloody well like, their well deserved reputation for entirely consistent, monumental wrongology remains intact.

  7. Bruce J

    The availability of data does not, by any stretch of the imagination, ensure the application of common sense.

  8. Leigh Lowe

    What is required is some computer package that can automatically analyse a paper for errors.

    And plagiarism.

  9. Samuel J

    Driftforge: just as I accept that the speed of light is a constant close to 300,000 km/s, so too do I accept that chaos cannot be predicted.

    Now I might be wrong about both, but they are well accepted and until there is proof to the contrary I will continue to accept these theories.

    If random behaviour could be predicted it wouldn’t be random.

  10. Tom

    The availability of data does not, by any stretch of the imagination, ensure the application of common sense.

    Indeed. Whatever else it does, the data revolution merely exposes the mendacity and dishonesty of the people using it in this post-modern era of self-hatred about the spectacular success of Western civilisation.

  11. ChrisPer

    The data torrent is bloody thin in some areas. How do we measure the effects of legislation, eg sentencing changes, laws to attack organised crime, or the bloody stupid 2002 handgun buyback? The negatives are ignored, the positives misrepresented, and the externalised costs covered up.

  12. Tel

    The definition of chaos is that it is impossible to forecast with any level of knowledge.

    The molecules in a cup of water bounce around all over the place, in total chaos. So thus you conclude we cannot predict where the water will be in the next hour?

  13. samuel j

    No, that’s confusing a system that is partly in chaos with elements that are not chaotic. It is confusing the micro with the macro. The Martian moon phobos tumbles in a chaotic system and is impossible to predict where it will tumble next. Yet its orbit can be predicted for 1000s of years – assuming no other object knocks into it. It is an object that exhibits chaotic and predictable behaviour.

    My claim – such that it is – is that one cannot make a prediction in relation to a chaotic system.

  14. Frank

    SamuelJ,

    No, Sabrina. The definition of chaos is that it is impossible to forecast with any level of knowledge.

    The definition of chaos is (partly) that it is deterministic.

  15. Julian mclaren

    My favourite saying (given I am a Financial Planner that ignores the constant spin of an industry completely conflicted) is “the future is a series of unpredictable events”.

  16. feelthebern

    All data the ABS collects should be freely available to everyone.
    Their data base should have an aspect of open architecture, so any uni, business, citizen can tap in & trawl through what ever they like.
    Zero FOI protections.
    If the ABC collects it, it should be available within a month.

  17. Rabz

    If the ABC collects it, it should be available within a month.

    If the ALPBC did collect it, the data would be tortured out of all recognition and rendered utterly useless. That is their standard MO with regards to news, politics and current affairs, after all.

  18. feelthebern

    If the ABC collects it, it should be available within a month.

    lol
    I’ll put that one down to auto spell.
    If the ABS collects it, it should be available within a month.

  19. incoherent rambler

    Samuel, how about data veracity? What do we do about the CSIRO and BoM fiddling raw data to suit the hypothesis of the day?

  20. Andrew

    The CO2 part is non-chaotic. Fortunately, as that way an upper bound can be easily placed on CO2 sensitivity from empirical history. It is known to be in the range 0-2K / doubling, with the physicists putting a most likely value of 1K on it.

    Next time you see a warmist chart in the Greenian or referenced from SkS have a look at the slope of the trend. It will be trivially small if from last century or negative from is one.

  21. WhaleHunt Fun

    Let’s divert that vast computer network food and water, presently devoted to lying carpetbagger climate voodoo warmist collaborators to a hopeless pursuit, to more near-term analysis which has relevance to the present and can provide helpful findings.

    Fixed

  22. Bruce of Newcastle

    An increasing problem today is the vast flood of data makes it easier to hide things you don’t want seen. This is coupled with increasing politicisation of science – and not just climatology. Think about the recent pronouncements on tertiary smoke (!), sugar and alcohol.

    Unfortunately this destroys the credibility of science with the people. General maths and statistics learning is disappearing from schools, and ordinary people just aren’t equipped to penetrate through the jungle of obfustication from these people.

    I predict a great rationalisation. Science budgets are going to get massacred in the not so long term since taxpayers are not going to regard science and scientists with any deference at all. And why pay for something you can’t believe or understand? It is not as if much practical stuff is coming out of the government science sector. Most of the successes we see are from individuals and entrepreneurs, who are exactly the sort of disruptive people who the likes of CSIRO and the unis don’t want.

  23. ar

    To put this in perspective, that pile of 5.25 inch floppy disks would stand 5.5 million km high.

    I don’t know if that really puts it in perspective… just saying…

  24. braddles

    It’s certainly interesting that an increase in computing power of several million times (perhaps much more) has only increased weather forecasting skill by a factor of 3. Even then, how much of that factor of 3 would be due to more precise and widespread observations (more satellites) and improved theoretical understanding, rather than computer number crunching?

  25. cohenite

    Climate has patterns as well as stochastic elements. There are various ways of dealing with the stochastic factors; Koutsoyiannis is an expert on this and is well worth reading.

    An adequate methodology of dealing both with the deterministic and stochastic elements of weather and climate is essential for such things as using the temperature record to extract meaningful trends. Currently we have BOM producing its ACORN temperature network with cogent question marks remaining over its claim that the defects with the previous HQ network had been resolved.

    There is no doubt the current official analysis of climate from which predictions and forecasts are made has been corrupted by the AGW ideology. If the modelling has inherent assumptions about the climatic effect of CO2, which are demonstrably wrong and also incorrect assumptions about such things as clouds then calamitous forecasts and consequent policy will occur such as the Wivenhoe will occur.

  26. Botswana O'Hooligan

    The best weather forecast the Bureau gave me was “it will be wet and sandy on the beaches” That was about as good as “fine throughout except for isolated showers and thunderstorms on Cape York Peninsular” broadcast daily by the ABC in N Qld. for years on end. Mind you, the Townsville forecaster at Garbutt aerodrome used to watch ants carefully and the reason he gave for not tossing them in the air to determine wind direction was, cruelty to the little ants, and elephants are too heavy! That just about sums up the effectiveness of the BoM but they do produce a great calendar and employ those eccentrics and unemployable people who would otherwise be loose on the streets.

  27. Menai Pete

    In this, the Australia Bureau of Statistics is far from helpful. Behind the shield of privacy, Brian Pink’s ABS is an organisation that has a wall of data it refuses to release. The ABS is like a black hole – sucking up information which is never seen.

    The ABS is staffed by bureaucratic thugs with fascist inclinations who suck up information from unwitting individuals and businesses by using threats and intimidation. The ABS then has the gall to claim that information was willingly provided for its surveys when that information was only provided to avoid a financially crippling fine

  28. Menai Pete

    Slavery and forced labour are not dead when dealing with the Australian Bureau of Statistics.
    The information that the ABS collects is provided by individuals and businesses who are NOT paid for the time, effort and cost that is required to gather the information and provide it to the (paid) bureaucratic cretins who collect and process it.

  29. johanna

    It is about time that governments stood up to weather bureaux, climate change researchers and their ilk constantly demanding bigger and shinier toys to play with, at our expense. The assertion that failures of prediction are due to inadequate computer power are an absolute crock.

    Forecasting the behaviour of dynamic complex systems (economic forecasting is similar) has a couple of simple properties that these people seem to have forgotten.

    The first is that accuracy is inverse proportion to the length of time forward being forecasted. Hence, a three day forecast,whether of the economy or the weather, is inherently more reliable than a three month or three year one. As for decades or centuries, they are just making stuff up.

    Second, in attempting to forecast such systems, using two or three well understood variables generally produces better results than attempting to model the interaction of a large number of less understood ones. Every extra variable introduces the potential for more, and escalating errors.

    It would require a breakthrough of at least Einsteinian proportions to overcome these limitations. No amount of extra computer power makes any difference.

    That is not to say that highly complex modelling does not play a useful role in areas like physics and engineering. But climate and economics – fuggedabout it. There are plenty of boosters out there who disagree, but the evidence supporting their pleas for more handouts is thin, to put it mildly.

  30. What is required is some computer package that can automatically analyse a paper for errors.

    We have one already. It’s called a ‘brain’.

    And when it belongs to an educated, critical and informed person, it’s amazing what they can pick up.

  31. Toiling Mass

    Forgive the orthographic pedantry, but I believe it is ‘terabyte’, rather than ‘terrabyte’ – which would seem to be related to ‘Earth’.

    Of course, there are ‘terror-bytes’ -my office laptop has plenty of these.

  32. boy on a bike

    I’d like to see a lot more data put through the sort of visualisations that Hans Rosling does. If you haven’t seen him on youtube, you really should have a look.

    I’d start with NAPLAN data – I think it would be fascinating to see it visualized the way Hans does it.

  33. Louis Hissink

    Samuel, how about data veracity? What do we do about the CSIRO and BoM fiddling raw data to suit the hypothesis of the day?

    If you don’t believe in objective truth then data fiddling becomes permissible – whether climate or as Keith Windschuttle has demonstrated, Australian aboriginal history. It’s part and parcel of being of the political left mindset – where truth is determined by a hand count.

    I have, however, no civilised manner of solving this problem.

  34. Nato

    If you can balance that many disks like that, I’d love to see a house of cards you build. Maybe post a photo? We always used to lay ours flat on top of each other. Your ceiling doesn’t need to be so high.

Comments are closed.