What if My Data Quality Is Not Good Enough for Analytics or Situational Intelligence?


Spreadsheet 01


You may feel that the quality of your data is insufficient for driving decisions and actions using analytics or situational intelligence solutions.  Or, you may in fact know that there are data quality issues with some or all of your data.  Based on such feelings or knowledge, you may be inclined to delay an analytics or situational intelligence implementation until you complete a data quality improvement project.

However, consider not only the impact of delaying the benefits and value of analytics , but also that you can actually move forward with your current data and achieve early and ongoing successes. Data quality and analytics projects can be done holistically or in parallel.

“How?” you ask. Consider these points:

  • Some analytics identify anomalies and irregularities in the input data. This, in turn, helps you in your efforts to cleanse your data.
  • Some analytics, whether in a point solution or within a situational intelligence solution, recognize and disregard anomalous data. In other words, data that is suspect or blatantly erroneous will not be used, so the output and results will not be skewed or tainted (see this related post for a discussion about: “The Relationship Between Analytics and Situational Intelligence“). This ability renders data quality a moot point.
  • It is a best practice to pilot an analytics solution prior to actual production use. This allows you to review and validate the output and results of analytics before widespread implementation and adoption. Pilot output or results that are suspect or nonsensical can then be used to trace irregularities in the input data.  This process can  play an integral part in cleansing your data.
  • Some analytics not only identify data quality issues but also calculate a data quality score that relates to the accuracy and confidence of the output and results of the analytics. End-users can therefore apply judgement if and how to use the output, results, recommendations, etc. Results with low data quality scores point to where data quality can and should be improved.
  • Visualization is a powerful tool within analytics to spot erroneous data. Errors and outliers that are buried in tables of data stand out when place in a chart, map or other intuitive visualization.

You can be pleasantly surprised at how much success you can achieve using data that has not been reviewed, scrubbed or cleansed. So set aside your concerns and fears that your analytics or situational intelligence implementation will fail or have limited success if you do not first resolve data quality issues.

Instead, flip such thinking around and use analytics as one of the methods to review and rectify data quality.  In other words, integrating analytics into your efforts to assess and cleans your data is a great way to leverage your investment in analytics and get started sooner rather than later.

What are you waiting for?  Get started exploring and deriving value from your data no matter the status of its quality.


Analytics 2016: A Look Ahead


At the end of 2013, Gartner predicted that by 2017, one in five market leaders would cede dominance to a company founded after 2000 because they were unable to take full advantage of the data around them. They also predicted that by 2016 one in four market leaders would lose massive market share due to digital business incompetence.

Has this come to pass? Consider:

Gartner may not be prophetic, but it looks like they at least identified a major trend. So what will you need to take full advantage of big data in 2016 and remain, or become, a leader?

How much data are we talking about?

Big Data is a well-worn term anymore, but how much data is big? Jeremy Stanley, the CTO of Collective, cites this IDC graph in his presentation “The Rise of the Data Scientist.”

Spaur 2016 Look Ahead 01

By comparing points on the curve corresponding to 2016 and 2017, we can see that about 2,000 exabytes of data will be added to the digital universe in 2016.

2,000 exabytes equals two trillion gigabytes. This is roughly equivalent to the entire digital universe in 2012. Or for a silly spatial comparison: If 2,000 exabytes were divided among one-inch long, one gigabyte USB sticks, those sticks would stretch 31.6 million miles, reaching nearly from Earth to Venus.

Where will all this data come from?

In a blog post from micro-controller manufacturer Atmel, we can see that approximately five billion connected devices will join the Internet of Things in 2016.

Spaur 2016 Look Ahead 02

This is based on a consensus estimate drawn from firms such as Cisco, Ericsson, Gartner, IDC and Harbor Research.

Like the projected growth in data, the 2016 growth in connected devices will equal the entire universe of devices just a few years ago.

What kinds of data will we see?

It’s worth reflecting on what type of data IoT devices generate, because the types of data influence the types of analytics. Those additional five billion devices will provide data that:

    • allow manufacturers to follow their product through the supply network to the end-consumers
    • communicate when and where they are being used and how often
    • communicate when they need refilled, replenished, repaired, and replaced
    • alert when they are operating under distress and may fail
    • provide transportation and logistics operators with more granularity in managing their cargo and fleets
    • provide convenience to the people who deployed them (such as automatically adjusting the thermostat to a comfortable climate when the a person is within 15 minutes of their home)

Because devices are connected and communicating, they deliver a stream of data. This becomes a time series of data, since when data is recorded, sent and received yields useful insight into the data itself and the people /activities that generate that data.

Because devices are out in the world and not trapped on a desk top or in a data center, their location matters. This becomes GIS data, since where a device is it, what it’s near, and what it’s connected to on a network yields useful insight into the data itself and the people / activities that generate that data.

Time series and GIS data require new repositories and analytics that many organizations don’t yet have. This will become a challenge for companies in 2016. (The implications of new data types is a big topic that we’ll be exploring further in 2016.)

How will we handle and analyze all that data?

In his book The Singularity is Near, Ray Kurzweil argues that we’re drawing close to when a $1,000 computer will have the same cognitive power as a human brain.

Spaur 2016 Look Ahead 03

In 2016, a $1,000 computer will surpass a mouse brain. (You didn’t realize that a mouse brain does so much, did you?) The $1,000 human brain is just a couple years away at current rates.

We’re already at the point where, for many analytical tasks, we require computerized brains to do our heavy data integration and computational work. Think of weather modeling or financial markets or piloting aircraft and spacecraft.

What software will run on these more powerful computers? A Forbes article by Louis Columbus summarizes trends in big data analytics through 2020, including this graph:

Spaur 2016 Look Ahead 04a

In 2016, the big data analytics market will grow by approximately $1 Billion across five main categories: real-time analysis and decision-making, precise marketing, operational efficiency, business model innovation, and enhanced customer experience. The analysis of transactional, time-series and GIS data applies across these five domains.

Are you ready for 2016?

Like Gartner, these other studies are not necessarily prophetic, but they do point to the overall trend. The opportunity awaits in 2016 for you to apply increasingly affordable computing and analytics power to correlating, analyzing and visualizing new types of data to generate new insights, new opportunities and new revenues, thereby avoiding the fate of the eclipsed companies listed at the start of this post.

How do you take advantage of this opportunity in 2016?

Start small and move fast in testing use cases of data-driven changes that make an impact in your operational efficiency and your relationships with prospects and customers. That covers three of the five categories of analytics listed in the Forbes article. Increased operational efficiency generates savings that you can apply to further data-driven initiatives. Improved relationships with prospects and customers increase top-line revenues and bring you market visibility. With increased top-line revenues and bottom-line savings, you’re on your way to data-driven business improvement.

Why do you need to do this? Your customers expect it and your shareholders require it, mainly because your competitors are already doing it.

(Ron Stein contributed to this post)


The Science of Visualization: Organizing Streaming Analytics


How do you visualize a lot of things happening all at once? With all the hoopla about real-time analytics, the Internet of Things and cloud computing, how could you distill an actionable story from all the bytes floating around us?

For example, consider audience reactions during presidential debates. During past debates, CNN has displayed something call the People Meter, which consists mainly of a running line chart tracking real-time changes in audience approval based on candidate statements. (Search the Internet for “CNN People Meter” to learn more.)

Tinklepaugh CNN People Meter

In this screen capture from a CNN presidential debate, we watch how audience perception fluctuates according to gender, income level and political affiliation as candidates address various topics. CNN was measuring perception of the crowd. But that perception did not reflect sentiment of a population for very long once the topic changed. It was a high level aggregate of perceptions relating to various topics in the debate.

For the upcoming presidential primary season, let’s say that a candidate’s campaign staff wants to be more savvy and measure sentiment in individual states on a given issue or word. Then, they could guide their candidate with word suggestions to use not just in debates but throughout the campaign. This would allow the staff to identify patterns of the largest spike in perception that creates positive sentiment among all states while minimizing negative spikes, in real time. The end result would be their candidate connecting with more people more than their opponent

Recently my friend and data scientist Zaid Tashman proposed a way a campaign might accomplish this. He drew some time series charts attached to a dendogram (tree) and mentioned how it allows users to quickly compare many lines of data by seeing differences of distance within and between groups.

So what would this tool look like? Tashman suggested using use spark lines that are colored by cluster as the time series charts with the dendogram. After he explained it, I made this prototype user interface that allows the user to perceptually discern groups with similar responses to the hashtag #syria used across multiple social media platforms (Facebook, Twitter, Instagram, LinkedIn, Google+). I also followed the advice of a visual designer I know, Bob Farrigan, and included a choropleth map.

Tinkelpaugh dendogram

Together, the dendogram and the states would update in real time to reflect usage of the hashtag #syria during presidential primaries.

The problem of course would be that only posts that include location data from the user could be considered. Still, this approach should yield more insight for campaign staff than the CNN People Meter.

Decreasing Saccadic Eye Movements

So scientists and designers may ask why I put the dendogram on the right when it traditionally goes on the left. This was my attempt to decrease visual scanning (eye saccade)

Tinklepaugh visual scan

The dendogram become one more thing my brain must process as I learn how the user interface elements inform each other. If the dendogram is on the left then all I’m doing is considering abstract distances without context. Putting it on the right decreases visual intake and processing, allowing users to understand the data faster.

From a neurological level, another reason to put the dendogram on the right  is that it changes in time. The feint grey lines cue the user where to look. Experiments by Ogden and Niessen in 1978 show that a cue, once shown, makes users faster at moving their eyes to the target.

Also, we read timelines and sentences from left to right, so I arranged the information from left to right. As I scan from left to right through states, spark lines and dendogram, the information that maximizes understanding is presented first.

A Question for You

If you were the campaign worker, would you propose this hashtag tool to your candidate?