Machine Learning and Data Science Careers #3

In part one of this guide we discussed data science and the field as a whole; in part two of this guide we discussed machine learning; and now we'll touch on big data.

It’s worth touching on the topic of big data as its mention frequently goes hand in hand with data science and machine learning.

There is a quite popular misconception that computers have always been good at dealing only with relatively small data volumes, and that 21st century techniques are required as soon as millions or billions of records are involved.

This is a bit of a disservice to software engineering: as early as 1954 a UNIVAC I (a 2MHz valve computer with less than 12K of memory, depending on how you count it) was processing the entire US economic census for 163 million people.

Today a single SQL database, built according to traditional computer science techniques developed in the 1960s, can query a quarter of a billion records in real time and with minimal latency (and a skilled engineer can multiply these numbers without resort to novel approaches).

The Large Hadron Collider at CERN provides an interesting modern case from the applied sciences - its detectors can generate 600 million samples per second (25 GB of data per second), and this would be (and is) many petabytes of data as a whole. At first, this seems gargantuan - surely this is "big data"?

Looking more closely though, this data is streamed out live to a large distributed system made up of many thousands of computers (the LHC computing grid), each processing a small part of the data in isolation, without reference to the other data being processed in parallel around the globe, and using traditional (though certainly advanced, in physical science rather than computer science terms) techniques. Some may remember SETI at Home from 1999, and draw parallels, however inexact; others the work on distributed computing techniques which dates to the late 1970s and early 1980s, and draws from yet earlier work on message-passing systems which also led to operating system design and the invention of object-oriented programming.

Individual parts of the LHC computing grid perform several stages of analysis and filtering, but at no time is such a vast amount of data needed for a single purpose (and nor is the delay between data being generated and an answer being sought so brief) that traditional computer science approaches would break down. The achievement and the engineering behind it are both deeply impressive, but all can be tackled by smart people applying a classical divide and conquer approach and (at least from the software engineering side) traditional techniques.

These discussions are in no way intended to trivialise the excellent, even sometime awe-inspiring, work done by scientists and engineers working in all of these fields, nor to imply they need to just "read the manual" on computer engineering and presto, a system is created. Such endeavours require great dedication and skill. It is more to say that the work done by software engineers and computer scientists in the last century has well prepared us to approach these kinds of tasks, provided latency at huge scale is not a critical factor.

If big data implies a non-traditional approach, then in order to qualify as truly big data, it could be argued that there must be a convergence of two factors: firstly one has to be working with extremely large data volumes, of the order of many billions or even trillions of records; secondly, one has to be looking to query (or otherwise operate on) all of this data with minimal latency (not just restricting searches to, for instance, recent activity). Social networks such as Twitter, with its 21 million new tweets per hour, and the entire catalogue of tweets instantly searchable, may qualify; search engines such as Google (applying natural language processing to searches of 130 trillion pages, reindexing perhaps 100 billion pages per day) may qualify; few other systems do.

However, it is useful to indicate to people who enjoy working with large datasets that such an opportunity exists in a particular field or with a certain company; and indeed some statistical techniques benefit greatly in their accuracy from having a larger corpus to work from.

Hence the term big data is used quite freely, and does not always (or indeed, often) imply that unusual or revolutionary approaches are needed. But equally, nor does this preclude the possibility that an organisation or individual is doing something very smart.