1.

It is the use of Statistical Techniques

for Data Mining.

2.

It comprises of constructing database

interfaces into statistical softwares.

3.

Statistics is used to detect structure of

the data and perform meaningful analysis.

4.

On the basis of a large amount of data,

one of few inferences can be drawn.

Top

Techniques Used in Statistical Data Mining

1.

LINEAR REGRESSION:

·

It fits the best

possible linear relationship between the independent and dependent variable

and hence predicts the target variable.

·

This is done by ensuring minimum distance (and hence min

error) between actual observation and the shape at each point.

a.

Single Linear Regression:

It uses 1 independent variable to find the dependent variable.

b.

Multiple Linear Regression:

a.

It uses more than 1 independent variables to find the

dependent variable.

1. CLASSIFICATION:

·

It classifies data into categories for easier analysis.

·

It’s a.k.a. Decision Tress

·

It’s used for large data sets

·

Logistic Regression:

a. It’s used in case of binary dependent variables

b. It falls under predictive analysis

c.

It explains the relationship

between dependent and one or more ordinal independent variable

·

Discriminant Analysis:

a.

One/More clusters (a.k.a. priori), One/More observations

are classified basis some characteristics, into populations.

b.

Response classes are used to store the predictors basis

which the above are classified and their probability is calculated by Bayes’

Theorem.

c.

Linear or Quadratic models can be used for the same.

2. RESAMPLING METHODS:

·

In Resampling, repeated samples are drawn from the original.

·

It’s a statistical inference method which is non-parametric

i.e. general distribution tables are avoided

·

Resampling generates a unique sampling distribution on

the basis of the actual data. It uses experimental methods, and not analytical

methods are used to generate unique distribution of sampling from data

3. SUBSET SELECTION:

·

This approach uses the subset of predictors believed

to bear a relation to response.

·

A modal via least squares (of subset) is then suitably

fitted.

Best Subset Selection: For every possible combination

that exist with predictors p,an OLS regression is fitted.

Algorithm:

i.

Stage 1: All models containing predictors

k are fitted, where k =maximum length of models

ii.

Stage 2: 1 model is

selected via cross validated prediction error.

b.

Forward Stepwise Selection: Initially model

contains no predictors, then predictors are added in an incremental fashion

until all predictors are exhausted. The variable giving the greatest

improvement on being added are added via cross validated prediction error.

Backward Stepwise Selection: Initially,

all predictors p are present inside the modal,afterwhich, least

useful predictors are removed one by one.

Hybrid Methods: It’s

similar to forward stepwise method, however, after addition of every new

variable, it may remove variables not contributing to the model.

4. SHRINKAGE:

·

Here, all predictors p are fitted, estimated coefficient(s)

are reduced to zero relative to estimated least squares.

·

This shrinkage, a.k.a. regularization

reduces variance.

·

It also does variable selection.

·

Techniques

a.

Ridge regression:

It’s is same

as least squares but the coefficients are differently estimated

Coefficient estimates which reduce RSS are seeked.

It projects

data into directional d space,

then reduces the coefficient(s) of low variance component(s) more as

compared to higher variance component(s)

Disadvantage: It includes every predictors p in the

culminating model.

b. Lasso:

i.

Lasso overcomes the disadvantage of Ridge Regression

and forces few coefficients to 0,provided, s is small

enough.

ii.

It also follows variable selection

Some

Applications of Statistical Data Mining

1.

Healthcare: Best practices that reduce

cost while improving healthcare are searched

Approached Used:

Multi-dimensional databases, Machine Learning, Soft Computing, etc.

2.

Market Basket Analysis: The basis is that

customers buying a certain group pf products are likely to buy another set of

products

Approached Used:

Differential Analysis, etc.

3.

Education: Student’s future learning

patterns, effective learning techniques are identified.

Approached Used: Predictive

Analysis, Machine Learning, etc

4.

Manufacturing Engineering: Relationship

between architecture, customer, product portfolio is established.

Approached Used:

Predictive Analysis,etc.

5.

Fraud Detection: Money Laundering. Theft

is dealt with.

Approached Used:

Decision Algorithms, etc.

6.

Financial Banking: Co-relationship between

Business information and market prices is found out.

Approached Used:

Clustering, etc.