It is the use of Statistical Techniques
for Data Mining.
It comprises of constructing database
interfaces into statistical softwares.
Statistics is used to detect structure of
the data and perform meaningful analysis.
On the basis of a large amount of data,
one of few inferences can be drawn.
Techniques Used in Statistical Data Mining
It fits the best
possible linear relationship between the independent and dependent variable
and hence predicts the target variable.
This is done by ensuring minimum distance (and hence min
error) between actual observation and the shape at each point.
Single Linear Regression:
It uses 1 independent variable to find the dependent variable.
Multiple Linear Regression:
It uses more than 1 independent variables to find the
It classifies data into categories for easier analysis.
It’s a.k.a. Decision Tress
It’s used for large data sets
a. It’s used in case of binary dependent variables
b. It falls under predictive analysis
It explains the relationship
between dependent and one or more ordinal independent variable
One/More clusters (a.k.a. priori), One/More observations
are classified basis some characteristics, into populations.
Response classes are used to store the predictors basis
which the above are classified and their probability is calculated by Bayes’
Linear or Quadratic models can be used for the same.
2. RESAMPLING METHODS:
In Resampling, repeated samples are drawn from the original.
It’s a statistical inference method which is non-parametric
i.e. general distribution tables are avoided
Resampling generates a unique sampling distribution on
the basis of the actual data. It uses experimental methods, and not analytical
methods are used to generate unique distribution of sampling from data
3. SUBSET SELECTION:
This approach uses the subset of predictors believed
to bear a relation to response.
A modal via least squares (of subset) is then suitably
Best Subset Selection: For every possible combination
that exist with predictors p,an OLS regression is fitted.
Stage 1: All models containing predictors
k are fitted, where k =maximum length of models
Stage 2: 1 model is
selected via cross validated prediction error.
Forward Stepwise Selection: Initially model
contains no predictors, then predictors are added in an incremental fashion
until all predictors are exhausted. The variable giving the greatest
improvement on being added are added via cross validated prediction error.
Backward Stepwise Selection: Initially,
all predictors p are present inside the modal,afterwhich, least
useful predictors are removed one by one.
Hybrid Methods: It’s
similar to forward stepwise method, however, after addition of every new
variable, it may remove variables not contributing to the model.
Here, all predictors p are fitted, estimated coefficient(s)
are reduced to zero relative to estimated least squares.
This shrinkage, a.k.a. regularization
It also does variable selection.
It’s is same
as least squares but the coefficients are differently estimated
Coefficient estimates which reduce RSS are seeked.
data into directional d space,
then reduces the coefficient(s) of low variance component(s) more as
compared to higher variance component(s)
Disadvantage: It includes every predictors p in the
Lasso overcomes the disadvantage of Ridge Regression
and forces few coefficients to 0,provided, s is small
It also follows variable selection
Applications of Statistical Data Mining
Healthcare: Best practices that reduce
cost while improving healthcare are searched
Multi-dimensional databases, Machine Learning, Soft Computing, etc.
Market Basket Analysis: The basis is that
customers buying a certain group pf products are likely to buy another set of
Differential Analysis, etc.
Education: Student’s future learning
patterns, effective learning techniques are identified.
Approached Used: Predictive
Analysis, Machine Learning, etc
Manufacturing Engineering: Relationship
between architecture, customer, product portfolio is established.
Fraud Detection: Money Laundering. Theft
is dealt with.
Decision Algorithms, etc.
Financial Banking: Co-relationship between
Business information and market prices is found out.