Like an Echo Chamber

We gather a large, if not a major part of information from the internet. News portals and social media are only a mouse click away and provide information we think we need. This is a very positive side of the internet. A downside, however, is that people under certain circumstances may find themselves in echo chambers where like-minded people group together and listen to information and arguments that are uni-directional and are just replicating and confirming already existing knowledge, views, and opinions.

At the heart of all big data analytics projects lies the specification of predictive statistical models. A situation often encountered is that the model is well fitted to the data set which was used to train it. However, if applied to new data the model provides much lower quality. This phenomenon is called over-fitting. The model can just replicate and confirm already seen data and is not able to deal with new data. It cannot be generalized and is useless for predictions.

And this is like a statistical echo chamber.

Of course, statistical science provides techniques to mitigate over-fitting, such as feature removal, or cross-validation for example. Such techniques, however, do not replace the fundamental requirement to use relevant, representative and high-quality data for the modeling process. A health care data set with diabetes prescriptions of patients older than 55 years only will never be useful to specify a predictive model for diabetes product consumption of the entire population. And this is irrespective of statistical models used and the buzz wording around them.

Investments in data quality and relevance give good returns in terms of model quality and relevance.