Big Data: Smart Analytics

Thursday 15 May 2014

This Blog has moved

MV Concepts has just completed a re-structuring and also has a new name: Complex Systemics. Why?

Followers of this blog can continue to read my posts at the new blog address:

http://bigdata-smartanalytics.blogspot.co.uk/

best regards,

Valda

Tuesday 29 April 2014

Big Data - the Big Mistakes (part four)

“Last week I posted about the connections between big data sources and military intelligence categories. This week I’m looking at big data analytics and the intelligence process”

5. Isn’t Big Data Analytics a Data-to-Decision Process?

Alarm bells always start ringing when I see Big Data architecture presentations which show the analytics process as a conveyer-belt from data to decisions. They crop up far too frequently (just Google images for: “big data process” if you want to see examples). So let’s get the main message over right away:

Big Data Analytics is a Closed Loop, Data and Decision Cycle

We can learn a lot about the desirable elements of a Big Data analytics architecture by looking back to a military analogue – namely the intelligence process (sometimes called the intelligence cycle). Organisation of military intelligence began in earnest in the 1940’s and grew into an established discipline during the 50’s and 60’s when the basis of the modern intelligence process was established.

The modern intelligence process comprises of six elements arranged in a loop, I’ll start with the two that are usually omitted from Big Data analytics architectures:

Data Requirements – defining the data needing to be collected so that the gaps in existing knowledge, relevant to the unfolding situation, can be efficiently filled. There are a few key words in this description:

Gaps in existing knowledge: knowing what you don’t know is important; Donald Rumsfeld famously said that there are “known unknowns” and “unknown unknowns”. The “known unknowns” should be on our data requirements list – the “unknown unknowns” are a field in themselves and the subject of much study in complex systems;
Relevant to the unfolding situation: it’s neither practical nor necessary to know everything about the whole world so some degree of selection based on expectations about what is likely to happen next is essential;
Efficiently filled: as I said in an earlier post, data collection and processing involves cost – don’t do either unless they increase your decision making ability.

Planning and Direction – determining how the data requirements can be met and developing a robust plan for acquisition. The plan needs to address some or all of these aspects:

Availability: can the required data be obtained from a trustworthy and reliable source; and is the availability consistent with providers’ terms and conditions, legal constraints and ethical considerations?
Cost: what is the lifetime cost of the data and how often will it be required? How difficult is the data to manipulate or process into a form which is actually useful?
Timeliness: can the data be acquired and processed quickly enough that it will make a sensible difference to the resulting analytics output?
Precision: does the data have the required degree of precision, or, can it be processed to extract enough precision to meet the requirements?
Change control: how quickly will the data source change: are the protocols, formats and access controls stable, what overhead will be involved in maintaining this data source?

Data Collection – this may be one of the most expensive (and sometimes dangerous) aspects of military intelligence but it’s where Big Data is all at – the easy availability of large amounts of data. The main consideration here is to let the data collection be guided by the planning and direction phase.

Data Processing – both this stage and the next aim to convert raw data into actionable intelligence. The difference lies in the view of the data: data at this stage is seen through a microscope and aims to perform the following tasks:

Extraction: retrieval of the wanted data from the data stream, for example by fields or parsing;
Normalization: conversion to the required format and scale, for example language translation, word stemming or metric scaling;
Provenance: tagging of the data with its source so that decisions can be audited if necessary;
Evaluation: assessment of the data reliability or precision.

Information Analysis – as the level of abstraction of the data increases then so does the degree to which relationships between data become increasingly important:

Medium Angle Processing: information extraction based largely on experience with, or statistical analysis of, previous examples of the same data stream; including statistical pattern recognition and machine intelligence;
Wide Angle Processing: use of multiple data streams, together with wider “world-view” and historical contexts, to provide a knowledge rich assessment of the unfolding situation.

Dissemination and Decision Support – The final, feedback, element in the cycle where decisions can be facilitated, actions taken and new data requirements, planning and direction be initiated.

Putting these elements together into a Big Data analytics architecture has made a lot of sense for us and enabled us to streamline our whole process. What’s more it’s re-usable and process oriented, so it becomes easier each time.

We place emphasis on handling uncertainty so we use a probabilistic approach allowing us to incorporate priors, integrate multiple sources of information and produce decision support and data requirements, planning and direction in a principled manner. We have found this fits this architecture extremely well.

Tuesday 22 April 2014

Big Data - the Big Mistakes (part three)

“I’ve been posting about some of my hands-on experience with big data and the lessons learned. Here’s the first of two posts drawing connections between big data and military intelligence”

4. Big Data is So New We’re Pretty Much Still in the Dark.

This is a comment I hear very often and it needs to be put into context. OK, I accept that Big Data in the narrow sense of “processing vast quantities of social media data” has been with us around five years, and the wider definition as applied to business analytics only really started a decade ago.

However, defence organisations have been processing what we now call Big Data for a lot longer than that – at least 25 years and quite possibly a good deal longer. They’ve had plenty of time to refine their processing techniques and hone their skills on various types of Big Data. It’s a mistake for Big Data scientists to ignore that experience – I’m finding very useful information in technical publications from defence organisations in the 80’s and 90’s…

Next week I’ll be commenting on the intelligence cycle and levels of intelligence processing. Today I’m concentrating on data sources. Here are some of the military intelligence sources and what they might mean for modern Big Data:

OSCINT – the military acronym for Open SourCe INTelligence. Encyclopaedic documents, government reports (such as budgets and census data) and geographic databases are excellent sources of “scene-setting” data – often useful in defining a meaningful prior distribution before processing live information (more on this subject in a few weeks’ time);

TRAFINT – TRAFfic INTelligence. Knowing how much information is flowing through a particular channel is often the first indication of something interesting. Correlating events of interest with traffic flow is a simple early warning technique, particularly when combined with keyword filtering. It takes a lot less compute resource to count messages than to understand their content;

COMINT – COMmunications INTelligence. The current mainstay of Big Data is social media analysis; this would be classified as COMINT in a military context. As well as understanding the context of the communication, the value of other aspect is often overlooked. These include: who is communicating, their audience, their location, the timing of the message, how frequently this person communicates and what the influence of their communications are;

IMINT – IMage INTelligence. Most military intelligence processing is image related. There are over 1 billion digital camera equipped smartphones in regular usage and there’s a vast, untapped, ocean of image data publicly available on the internet. However, because processing it requires a wide data pipe and plenty of compute power, it hasn’t yet become a mainstream part of Big Data processing.

Hot Tip: Image analysis is a tiny part of Big Data at the moment, but its golden age is coming! I predict that within 5 years most of the Big Data processing load will be devoted to information extraction from users’ digital images.

Thursday 17 April 2014

Big Data - the Big Mistakes (part two)

“Last week I started a series of posts drawing from some of my hands-on experience with big data and the lessons learned. Here’s some more food for thought”

3. Big Data is too big to eye-ball.

Data changes all the time; the formats, the syntax, the content and the meaning. So, when performing regular data analytics there’s no substitute for the Mk. I Eyeball – any data scientist worth their salary will take a look at the raw data before designing an analytics strategy, and the processed data on a regular basis while the strategy is in operation; designing and implementing updates as necessary.

But you can’t do this with Big Data – there’s just too much of it – so you’ll just have to trust your data and press on, right? Wrong! Big data is far more variable and can change even faster than small data, so it’s essential that it gets at least as much attention. How do you eyeball that quantity of data? (I just checked, and the modest Big Data social media file I’m currently analysing for a client clocked in at 971Mb – I really don’t have the time to eyeball that amount of data on a regular basis!).

We use a package developed internally we fondly refer to as the “Mk. I Cybe-all”, but the principles are applicable for everyone:

Randomly select some of the data and output to a text file so the Mk. I Eyeball gets to see at least some of the data;

If a format or syntax is present, check each record using regular expressions and output any that don’t conform;

Build simple statistics of the data – record size, field size and character distributions are applicable to most data; output records that are statistically anomalous;

Constantly recalculate these statistics and look for movement;

If your analytics uses more complex statistics, regularly process data offline – again looking for data records that don’t fit;

Visualize the data – use simple clustering techniques to get similar data clumped together and colour to illustrate the basic statistics;

Analyse the data at more than one level of abstraction and allow the data scientist to drill down to get detail-on-demand.

Finally, once you have your own “Mk. I Cybe-all”, consider its value for your clients, we’ve been surprised several times to find out that, actually, the functionality of the “Mk. I Cybe-all” is all the customer really needed…

Monday 7 April 2014

Big Data - the Big Mistakes (part one)

“Over the next few weeks I'll be posting some of my hands-on experience with big data and lessons learned. I'm kicking off today with two common and closely connected mistakes”

1. Big Data is valuable, so bigger data is more valuable.

Focus too closely on the "how much" and you risk losing sight of the "why". Big data is a raw material and, like any raw material, needs refining and processing into a valuable and relevant end product. Don't build your big data policy by starting with the data then working out what cloud resources you need or how it might fit into the Hadoop framework. Define the end product - the decisions you need to make and work backwards to find out what data you need and how much of it is required. It's often surprising how little, high quality information is required to make important decisions (small data / big decisions). Remember; handling big data involves a cost - it's not for free. This cost needs to be factored in to the net value-add for the final decision.

Analogy: drinking large quantities of seawater is a bad idea, the more you drink the worse the situation gets - better to define the end product (hydration) and process just enough seawater to produce a drink that will do the job.

2. Big Data can be used to avoid uncertainty.

No decision maker is likely to view uncertainty as a friend and data often comes with loads of the stuff - so what's the best approach to dealing with it? I most commonly see two approaches, both of which lie somewhere on the spectrum of suboptimal to disastrous. I call them the "ostrich" and the "perfectionist". The ostrich approach is to pretend there's no uncertainty and just make the decision as if complete information was available. The problem here is that, having made the decision, people tend to use increasingly elaborate explanations of why subsequent data isn't as expected but the decision is still OK. The perfectionist causes a more subtle problem: they reduce uncertainty by waiting for (and processing) more data. Not only is this costing more (see mistake #1), but is also making the decision less timely and possibly less relevant. Factoring in timeliness is a key (and often ignored) part of a good decision - remember that the world doesn't stop while you're thinking.

Analogy: You need to walk across a busy concourse full of people. Better to make a short term decision about where nearby people might be in the next few seconds, start walking and collect more data / make another decision again quickly. Not a good idea to observe the crowd for ten minutes, calculate an optimum route then close your eyes and go for it!