The real world, whether it be the physical world, for example machines, or the natural world, for example human and animal behaviour, is very complex with many factors, some unknown, determining their behaviour and responses to interventions. Even if every contributory factor to a phenomenon is known, it is unrealistic to expect that the unique contribution of each factor to the phenomenon can be isolated and quantified. Thus, mathematical models are simplified representations of reality, but to be useful they must give realistic results and reveal meaningful insights.
In his 1976 paper ‘Science and Statistics’ in the Journal of the American Statistical Association George Box wrote ‘Since all models are wrong, the scientist cannot obtain a ‘correct’ one by excessive elaboration. On the contrary, following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist, so over-elaboration and over-parameterization is often the mark of mediocrity.’ In some ways, this is an extension of a famous saying by Einstein, ‘Everything should be made as simple as possible, but not simpler'.
The terms data mining, statistical modelling and predictive analytics are often used interchangeably whereas they have different meanings, particularly data mining and statistical modelling. This document clarifies these and similar terms, and presents a glossary of the following terms:
- statistics and statistic
- mathematical models
- deterministic models
- statistical models
- data mining
Statistics and Statistic
The word statistics derives from stato, the Italian word for state. The original aim of statistics was the collection of information for and about the state.
The birth of statistics as we know it today was in the mid-17th century when John Graunt, a shopkeeper from London, began reviewing a weekly church publication issued by the local parish clerk that listed the number of births, christenings and deaths in each parish. The data were called Bills of Mortality and were published in a form that we now call descriptive statistics in Natural and Political Observations Made upon the Bills of Mortality. Graunt was later elected to the Royal Society.
Statistics can be used in the singular and the plural.
- The singular form is used when referring to the academic subject. For example, statistics is offered as a part of many university mathematics degrees. One definition of the subject statistics is ‘the science of collecting, organising and interpreting data’. The data analysed are usually a sample obtained from surveys, experiments or as a periodic snapshot, and the aim is to infer information about the population from which the sample was drawn. When the entire population rather than a sample from the population is analysed, the sample is called a census.
- The plural form refers to at least two quantities. For example, the mean and standard deviation are two summary statistics for continuous data.
The latter form can be used in the singular, statistic. For example, the mean is the most frequently used statistic for the central tendency of continuous data.
A mathematical model is a description of a system using mathematical concepts, for example algebra, graphs, equations and functions, and mathematical language, for example arithmetic signs.
Mathematical modelling is the process of developing mathematical models.
Deterministic Models and Statistical Models
Mathematical models can be classified in a number of ways, for example as deterministic models or as statistical (probabilistic) models.
- A deterministic model is a mathematical model in which the output is determined only by the specified values of the input data and the initial conditions. This means that a given set of input data will always generate the same output.
- A statistical model is a mathematical model in which some or all of the input data have some randomness, for example as expressed by a probability distribution, so that for a given set of input data the output is not reproducible but is described by a probability distribution. The output ensemble is obtained by running the model a large number of times with a new input value sampled from the probability distribution each time the model is run. Statistical models can be run by using Monte Carlo simulation.
So, another definition of a statistical model is a mathematical description of a system that accounts for uncertainty in the system.
Statistical modelling is the process of forming a hypothesis for a statistical model on a set of data, developing a model and then testing it on the data to see if the hypothesis is true.
Data mining is the process of analysing data to find new patterns and relationships in the data. In some ways, it is an ‘exploratory walk through the data’ without a particular objective in mind but with an open mind as to the patterns and trends in the data that will be revealed.
One of the main differences between data mining and statistical modelling is that data mining does not require a hypothesis for the model but statistical modelling does require a hypothesis. Thus, in statistical modelling a model is specified in advance but in data mining no relationships are specified.
Some Misconceptions about Data Mining
Data mining models can be complex, and so to get maximum benefit from them users must be familiar with them and equally important know their assumptions and limitations. The black box nature of some data mining software makes data mining easy to be misused or used incorrectly, and this can lead to bad and costly business decisions being taken. Following the earlier discussion about black box software in Analytics, it is worthwhile clarifying a few misconceptions about data mining.
Misconception 1: Data mining requires little or no human intervention.
Response 1: Data mining is a process, not an event, and so data mining projects should be carried out using a structured and robust methodology, such as CRISP-DM.
The data preparation phase of CRISP-DM is not a prescriptive process that can be totally automated but a creative task that requires human intervention if only for the following reasons:
- Each set of data is unique and so has its own characteristics that determine how and to what extent they need to be prepared for the analysis and modelling.
- Data preparation covers a very wide range of methods, and the particular methods used and the way they are applied depend on many things including the business aims of the project, the data and the modelling methods to be used.
Data mining and analytics software should not be used blindly, i.e. without understanding the modelling methods. If they are used by people who are not familiar with the methods, they will apply the wrong method to the wrong data and so will result in the wrong answer to the wrong question. The implications for business of using results generated using such an approach are self-evident.
Misconception 2: Data mining software packages are intuitive and easy to use.
Response 2: Data mining is concerned with analysis and modelling, not with IT. Data mining software is not ‘plug and play’ software and so only people with knowledge and experience of the models should use it so that it is used correctly and the maximum commercial benefit is gained from it.
Misconception 3: Data mining can identify the causes of the problem.
Response 3: Data mining is much more likely to identify the causes of the problem if it is used with CRISP-DM or a similar methodology. Therefore, to identify the causes of the problem, a thorough understanding of the background and context of the problem, and the data is essential.