Data mining
Data mining algorithms are now applied to discover the patterns and relationships in the data. The following functions are applied to the data sets:
-
Detecting anomalies – identifies unusual and uncommon data that are unexpected and do not follow the common pattern of other results. Any results that have found to be anomalies could be incorrect and will require investigating into.
-
Association rule learning – uses strict rules to identify relationships between the parameters used to obtain the data. Similar to machine learning, the machine uses algorithms to find the solutions, however, where it is differs is machine learning determines the algorithms itself and does not require the strict rules to be set.
-
Clustering – the machine groups together pieces of data that have similar properties to each other while leaving out data without those properties at the same time.
-
Classification – the system learns a function that finds data that hasn’t yet been defined and categorising it into a pre-defined class. The user will define a structure and the machine will categorise the data based on the rules of that defined structure.
-
Regression – analyses which function estimates the least incorrect results. It does this by understanding how the dependent variable reacts when independent variables are changed.
-
Summarisation – presents the data in a way that is more understandable to a user by data visualisation techniques.
The patterns that have been found can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics.
Interpretation/Evaluation
The final step in the KDD process is to establish the patterns identified by the data mining algorithms. Data mining may produce results that are not valid or don’t actually predict future behaviour and cannot be repeated on a different data sample. The algorithms may find relationships that are not there in the general set of data. This is known as overfitting.
To prevent this from happening the evaluation uses a test data set on which the mining algorithm was not learnt. This is a machine learning property. The newly learned patterns are applied to this test data set and the results are compared to the expected results.
If the learned patterns do not meet expected standards, the data mining steps must be re-evaluated and the data set created in the pre-processing step must be changed.
If the learned patterns do meet expected standards, they are interpreted and turned into system knowledge.
The knowledge is then organised and presented so that it understandable to the user such as in a report or a graph.