[location-weather id="7878"]

Data Fusion Techniques; Breaking down the fusion confusion

25 January, 2022 | 
Pureprofile 

By Johnny Caldwell, FMRS, Senior Director – Partnerships EMEA & US

It might sound a bit scary to the uninitiated, a technique reserved for data scientists and the Googles and Metas of the analytics world but quite simply in quantitative research we fuse data all the time.

Think of the frequency with which we apply weighting to our data sets using pre-accepted variables, segmentation or cluster analytics in order to produce more accurate representative outputs.

A popular and commonplace example is when we apply nationally representative (nat rep) weighting to a project’s results. The specific market’s nat rep figures – the weighting matrix – are accepted as true being as they are derived from official governmental Office for National Statistics figures. The matrix is applied and we re-align the emphasis of the demographic groups lacking in our sample in order to produce a truly representative final data set.

Consequently, Data Fusion can be explained as the integration of multiple data sources in order to produce more accurate and consistent outputs than would be provided by the original individual source.

Looking beyond the original data

If we go up a level, we might want to start layering the data and enhancing it with other useful information obtained outside of the original survey.

Obviously, if you are the owner of or have access to a traditional double opt-in panel this is in many ways easily achievable by simply building a data bank of behaviours on the audience as a whole or a subset thereof. However, it must be noted that the sample has to live in a well curated traditional panel environment – programmatic or river sampling by their transitory nature just won’t work.

Think of the vast amount of screening questions traditional panels ask. These are not just during the initial recruitment process but continuously throughout the life of a panellist in order to establish incidence rates and improve targeting. Because we know exactly who each individual respondent is and we’re not revealing any PII (Personal Identifiable Information), this information can easily be appended to survey data and so enhance the output.

The data sets are endless

We’re not just talking about demographics here. Many clients do however request that these be supplied allowing for a reduced survey length and consequently additional time to concentrate on more intrinsic subject related questioning.

The options are numerous and depending on exactly what the panel owner has previously asked they can include:

  • Psychographic and lifestyle data
  • Service and product ownership
  • Environmental and ethical opinion
  • Political and future voting persuasion etc.

Up another level and we can source more passive categories of data from our panel audience by utilising a fair data exchange methodology. Completely GDPR compliant and strictly with the full permission of our respondent, there already exists many technologies that now allow us to obtain browsing, app usage, transactional and geo-tracking data that again, because we know exactly who the originator is, can be appended to any related survey data.

But what happens if the data is from an external source?

Consequently not produced by the same individuals who have answered our survey. Fear not, as long as we have a decent number of identical data variable keys such as full demographics or psychographics the supplementary information can be fused with our original survey data set and still produce extremely useful actionable insights that might not have been formerly possible to obtain.

A great example of this is if you were lucky enough to obtain current ‘Touchpoints’ data produced by the IPA (The Institute of Practitioners in Advertising). This consumer behaviour database was created to meet the needs of the communications industry and offers unique insights into daily life and media usage across the United Kingdom.

The beauty of a data set like this is that it is designed by market research practitioners and so contains a lot of the familiar data keys common throughout the majority of online surveys meaning that both cohorts can easily be fused and again produce a great resource rich in granular detail. Also take a look at the TGI Survey (Target Group Index) originally created in 1969 by BMRB and now facilitated by Kantar.

Tackling the Big Data beast

Now up yet another level and this is where things start to become a bit more problematic and we have to take a completely different view. Let’s look at Big Data and how it can provide a more sophisticated model by pulling in intelligence derived from more disparate origins.

The subject, like the data, is vast and far too intricate for me to give it any explanatory justice here. We are producing more of this type of data than ever and it’s growing day by day.

However, it very rarely comes with a neat set of variables that market researchers can work with, we don’t often see full demographics. More likely the data is unstructured with no defining rules (a tweet, a photo, a video), this is where AI, Machine Learning and similar technologies come in order to help us to understand exactly what it means.

Having said that there are a host of other Big data sources that do offer some or limited degrees of structured data. If you have the right systems in place, analysed on their own you are looking for patterns in the connections that are formed, made accurate by the sheer volume of information you have to hand.

You can create at least some accord by attempting to source this type of semi-structured Big data. You might be able to get hold of wider/higher level unique identifiers like regional postcodes, location by output Level or super output Level, voting history, product purchase, geo-location etc. The fact that the data is so vast means that the veracity is statistically more reliable and you can at least fuse it with your own survey data and see what it throws up.

It’s always better to reverse engineer rather than retrofit

Try finding an interesting Big data source you’d like to use first before writing your questionnaire, then include the relevant data key variables across the board for an easier match.

Additionally, again depending on what relational database platforms and systems you have to hand start small, ideally with the most recently generated batch of data.

That all said, remember always try and obtain the most up to date data set you can get hold of and always be aware of the four V’s in data science: Volume (what’s the scale), Velocity (is it real time), Veracity (how trustworthy is it) and Variety (how many variables).

Continue Exploring

More data, insights and media from the experts

SIGN UP NOW

Stay up to date with our latest news, insights and trends reports

SIGN UP NOW

Stay up to date with our latest news, insights and trends reports