MV Concepts has just completed a re-structuring and also has a new name: Complex Systemics. Why?
Followers of this blog can continue to read my posts at the new blog address:
http://bigdata-smartanalytics.blogspot.co.uk/
best regards,
Valda
Big Data: Smart Analytics
Thursday 15 May 2014
Tuesday 29 April 2014
Big Data - the Big Mistakes (part four)
“Last week I posted
about the connections between big data sources and military intelligence
categories. This week I’m looking at big data analytics and the intelligence
process”
5. Isn’t Big Data
Analytics a Data-to-Decision Process?
Alarm bells always start ringing when I see Big Data architecture
presentations which show the analytics process as a conveyer-belt from data to
decisions. They crop up far too frequently (just Google images for: “big data
process” if you want to see examples). So let’s get the main message over right
away:
Big Data Analytics is
a Closed Loop, Data and Decision Cycle
We can learn a lot about the desirable elements of a Big Data analytics
architecture by looking back to a military analogue – namely the intelligence
process (sometimes called the intelligence cycle). Organisation of military
intelligence began in earnest in the 1940’s and grew into an established
discipline during the 50’s and 60’s when the basis of the modern intelligence
process was established.
The modern intelligence process comprises of six elements arranged in a
loop, I’ll start with the two that are usually omitted from Big Data analytics
architectures:
Data Requirements – defining the data needing
to be collected so that the gaps in existing knowledge, relevant to the
unfolding situation, can be efficiently filled. There are a few key words in
this description:
- Gaps in existing knowledge: knowing what you don’t know is important; Donald Rumsfeld famously said that there are “known unknowns” and “unknown unknowns”. The “known unknowns” should be on our data requirements list – the “unknown unknowns” are a field in themselves and the subject of much study in complex systems;
- Relevant to the unfolding situation: it’s neither practical nor necessary to know everything about the whole world so some degree of selection based on expectations about what is likely to happen next is essential;
- Efficiently filled: as I said in an earlier post, data collection and processing involves cost – don’t do either unless they increase your decision making ability.
Planning and
Direction – determining how the data requirements can be met and developing a
robust plan for acquisition. The plan needs to address some or all of these
aspects:
- Availability: can the required data be obtained from a trustworthy and reliable source; and is the availability consistent with providers’ terms and conditions, legal constraints and ethical considerations?
- Cost: what is the lifetime cost of the data and how often will it be required? How difficult is the data to manipulate or process into a form which is actually useful?
- Timeliness: can the data be acquired and processed quickly enough that it will make a sensible difference to the resulting analytics output?
- Precision: does the data have the required degree of precision, or, can it be processed to extract enough precision to meet the requirements?
- Change control: how quickly will the data source change: are the protocols, formats and access controls stable, what overhead will be involved in maintaining this data source?
Data Collection – this may be one of
the most expensive (and sometimes dangerous) aspects of military intelligence
but it’s where Big Data is all at – the easy availability of large amounts of
data. The main consideration here is to let the data collection be guided by the
planning and direction phase.
Data Processing – both this stage
and the next aim to convert raw data into actionable intelligence. The
difference lies in the view of the data: data at this stage is seen through a
microscope and aims to perform the following tasks:
- Extraction: retrieval of the wanted data from the data stream, for example by fields or parsing;
- Normalization: conversion to the required format and scale, for example language translation, word stemming or metric scaling;
- Provenance: tagging of the data with its source so that decisions can be audited if necessary;
- Evaluation: assessment of the data reliability or precision.
Information Analysis – as the level of
abstraction of the data increases then so does the degree to which
relationships between data become increasingly important:
- Medium Angle Processing: information extraction based largely on experience with, or statistical analysis of, previous examples of the same data stream; including statistical pattern recognition and machine intelligence;
- Wide Angle Processing: use of multiple data streams, together with wider “world-view” and historical contexts, to provide a knowledge rich assessment of the unfolding situation.
Dissemination and
Decision Support – The final, feedback, element in the cycle where decisions can be
facilitated, actions taken and new data requirements, planning and direction be
initiated.
Putting these elements together into a Big Data analytics architecture
has made a lot of sense for us and enabled us to streamline our whole process.
What’s more it’s re-usable and process oriented, so it becomes easier each
time.
We place emphasis on
handling uncertainty so we use a probabilistic approach allowing us to
incorporate priors, integrate multiple sources of information and produce
decision support and data requirements, planning and direction in a principled
manner. We have found this fits this architecture extremely well.
Tuesday 22 April 2014
Big Data - the Big Mistakes (part three)
“I’ve been posting
about some of my hands-on experience with big data and the lessons learned. Here’s
the first of two posts drawing connections between big data and military
intelligence”
4. Big Data is So New
We’re Pretty Much Still in the Dark.
This is a comment I hear very often and
it needs to be put into context. OK, I accept that Big Data in the narrow sense
of “processing vast quantities of social media data” has been with us around
five years, and the wider definition as applied to business analytics only
really started a decade ago.
However, defence organisations have
been processing what we now call Big Data for a lot longer than that – at least
25 years and quite possibly a good deal longer. They’ve had plenty of time to
refine their processing techniques and hone their skills on various types of
Big Data. It’s a mistake for Big Data scientists to ignore that experience –
I’m finding very useful information in technical publications from defence
organisations in the 80’s and 90’s…
Next week I’ll be commenting on the
intelligence cycle and levels of intelligence processing. Today I’m
concentrating on data sources. Here are some of the military intelligence sources
and what they might mean for modern Big Data:
- OSCINT – the military acronym for Open SourCe INTelligence. Encyclopaedic documents, government reports (such as budgets and census data) and geographic databases are excellent sources of “scene-setting” data – often useful in defining a meaningful prior distribution before processing live information (more on this subject in a few weeks’ time);
- TRAFINT – TRAFfic INTelligence. Knowing how much information is flowing through a particular channel is often the first indication of something interesting. Correlating events of interest with traffic flow is a simple early warning technique, particularly when combined with keyword filtering. It takes a lot less compute resource to count messages than to understand their content;
- COMINT – COMmunications INTelligence. The current mainstay of Big Data is social media analysis; this would be classified as COMINT in a military context. As well as understanding the context of the communication, the value of other aspect is often overlooked. These include: who is communicating, their audience, their location, the timing of the message, how frequently this person communicates and what the influence of their communications are;
- IMINT – IMage INTelligence. Most military intelligence processing is image related. There are over 1 billion digital camera equipped smartphones in regular usage and there’s a vast, untapped, ocean of image data publicly available on the internet. However, because processing it requires a wide data pipe and plenty of compute power, it hasn’t yet become a mainstream part of Big Data processing.
Hot Tip: Image analysis is a tiny part of Big
Data at the moment, but its golden age is coming! I predict that within 5 years
most of the Big Data processing load will be devoted to information extraction
from users’ digital images.
Thursday 17 April 2014
Big Data - the Big Mistakes (part two)
“Last week I started
a series of posts drawing from some of my hands-on experience with big data and
the lessons learned. Here’s some more food for thought”
3. Big Data is too
big to eye-ball.
Data changes all the time; the formats, the syntax, the content and the
meaning. So, when performing regular data analytics there’s no substitute for
the Mk. I Eyeball – any data scientist worth their salary will take a look at
the raw data before designing an analytics strategy, and the processed
data on a regular basis while the strategy is in operation; designing
and implementing updates as necessary.
But you can’t do this with Big Data – there’s just too much of it – so
you’ll just have to trust your data and press on, right? Wrong! Big data is far
more variable and can change even faster than small data, so it’s essential
that it gets at least as much attention. How do you eyeball that quantity of
data? (I just checked, and the modest Big Data social media file I’m currently
analysing for a client clocked in at 971Mb – I really don’t have the time to eyeball
that amount of data on a regular basis!).
We use a package developed internally we fondly refer to as the “Mk. I
Cybe-all”, but the principles are applicable for everyone:
- Randomly select some of the data and output to a text file so the Mk. I Eyeball gets to see at least some of the data;
- If a format or syntax is present, check each record using regular expressions and output any that don’t conform;
- Build simple statistics of the data – record size, field size and character distributions are applicable to most data; output records that are statistically anomalous;
- Constantly recalculate these statistics and look for movement;
- If your analytics uses more complex statistics, regularly process data offline – again looking for data records that don’t fit;
- Visualize the data – use simple clustering techniques to get similar data clumped together and colour to illustrate the basic statistics;
- Analyse the data at more than one level of abstraction and allow the data scientist to drill down to get detail-on-demand.
Finally, once you have your own “Mk. I Cybe-all”, consider its value for
your clients, we’ve been surprised several times to find out that, actually,
the functionality of the “Mk. I Cybe-all” is all the customer really needed…
Monday 7 April 2014
Big Data - the Big Mistakes (part one)
“Over the next few
weeks I'll be posting some of my hands-on experience with big data and lessons
learned. I'm kicking off today with two common and closely connected mistakes”
1. Big Data is
valuable, so bigger data is more valuable.
Focus too closely on the "how much" and you risk losing sight
of the "why". Big data is a raw material and, like any raw material,
needs refining and processing into a valuable and relevant end product. Don't
build your big data policy by starting with the data then working out what
cloud resources you need or how it might fit into the Hadoop framework. Define
the end product - the decisions you need to make and work backwards to find out
what data you need and how much of it is required. It's often surprising how
little, high quality information is required to make important decisions (small
data / big decisions). Remember; handling big data involves a cost - it's not
for free. This cost needs to be factored in to the net value-add for the final
decision.
Analogy: drinking
large quantities of seawater is a bad idea, the more you drink the worse the
situation gets - better to define the end product (hydration) and process just
enough seawater to produce a drink that will do the job.
2. Big Data can be
used to avoid uncertainty.
No decision maker is likely to view uncertainty as a friend and data
often comes with loads of the stuff - so what's the best approach to dealing
with it? I most commonly see two approaches, both of which lie somewhere on the
spectrum of suboptimal to disastrous. I call them the "ostrich" and
the "perfectionist". The ostrich approach is to pretend there's no
uncertainty and just make the decision as if complete information was
available. The problem here is that, having made the decision, people tend to
use increasingly elaborate explanations of why subsequent data isn't as
expected but the decision is still OK. The perfectionist causes a more subtle
problem: they reduce uncertainty by waiting for (and processing) more data. Not
only is this costing more (see mistake #1), but is also making the decision
less timely and possibly less relevant. Factoring in timeliness is a key (and
often ignored) part of a good decision - remember that the world doesn't stop
while you're thinking.
Analogy: You need to
walk across a busy concourse full of people. Better to make a short term
decision about where nearby people might be in the next few seconds, start
walking and collect more data / make another decision again quickly. Not a good
idea to observe the crowd for ten minutes, calculate an optimum route then
close your eyes and go for it!
Subscribe to:
Posts (Atom)