1. Data Mining 1. LINEAR REGRESSION: · It

1.      
It is the use of Statistical Techniques
for Data Mining.

2.      
It comprises of constructing database
interfaces into statistical softwares.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

3.      
Statistics is used to detect structure of
the data and perform meaningful analysis.

4.      
On the basis of a large amount of data,
one of few inferences can be drawn.

Top
Techniques Used in Statistical Data Mining

1.      
LINEAR REGRESSION:

·        
It fits the best
possible linear relationship between the independent and dependent variable
and hence predicts the target variable.

·        
This is done by ensuring minimum distance (and hence min
error) between actual observation and the shape at each point.

a.      
Single Linear Regression:
It uses 1 independent variable to find the dependent variable.

b.      
Multiple Linear Regression:

a.      
It uses more than 1 independent variables to find the
dependent variable.

 

1.       CLASSIFICATION:

·        
It classifies data into categories for easier analysis.

·        
It’s a.k.a. Decision Tress

·        
It’s used for large data sets

·        
Logistic Regression:

a.       It’s used in case of binary dependent variables

b.       It falls under predictive analysis

c.       
It explains the relationship
between dependent and one or more ordinal independent variable

 

·        
Discriminant Analysis:

a.      
One/More clusters (a.k.a. priori), One/More observations
are classified basis some characteristics, into populations.

b.      
Response classes are used to store the predictors basis
which the above are classified and their probability is calculated by Bayes’
Theorem.

c.       
Linear or Quadratic models can be used for the same.

 

2.       RESAMPLING METHODS:

·        
In Resampling, repeated samples are drawn from the original.

·        
It’s a statistical inference method which is non-parametric
i.e. general distribution tables are avoided

·        
Resampling generates a unique sampling distribution on
the basis of the actual data. It uses experimental methods, and not analytical
methods are used to generate unique distribution of sampling from data

 

 

3.       SUBSET SELECTION:

·        
This approach uses the subset of predictors believed
to bear a relation to response.

·        
A modal via least squares (of subset) is then suitably
fitted.

 

Best Subset Selection: For every possible combination
that exist with predictors p,an OLS regression is fitted.

Algorithm:

                    
i.           
Stage 1: All models containing  predictors
k are fitted, where k =maximum length of models

                   
ii.           
Stage 2: 1 model is
selected via cross validated prediction error.

 

b.      
Forward Stepwise Selection: Initially model
contains no predictors, then predictors are added in an incremental fashion
until all predictors are exhausted. The variable giving the greatest
improvement on being added are added via cross validated prediction error.

Backward Stepwise Selection: Initially,
all predictors p are present inside the modal,afterwhich, least
useful predictors are removed one by one.
Hybrid Methods: It’s
similar to forward stepwise method, however, after addition of every new
variable, it may remove variables not contributing to the model.

4.       SHRINKAGE:

·        
Here, all predictors p are fitted, estimated coefficient(s)
are reduced to zero relative to estimated least squares.

·        
This shrinkage, a.k.a. regularization
reduces variance.

·        
It also does variable selection.

·        
Techniques

a.      
Ridge regression:

It’s is same
as least squares but the coefficients are differently estimated
Coefficient estimates which reduce RSS are seeked.
It  projects
data into directional  d space,
then reduces the coefficient(s) of low variance component(s) more as
compared to higher variance component(s)
Disadvantage: It includes every predictors p in the
culminating model.

b.       Lasso:

 

        
i.           
Lasso overcomes the disadvantage of Ridge Regression
and forces few coefficients to 0,provided, s is small
enough.

       
ii.           
It also follows variable selection

Some
Applications of Statistical Data Mining

1.      
Healthcare: Best practices that reduce
cost while improving healthcare are searched

Approached Used:
Multi-dimensional databases, Machine Learning, Soft Computing, etc.

 

2.      
Market Basket Analysis: The basis is that
customers buying a certain group pf products are likely to buy another set of
products

Approached Used:
Differential Analysis, etc.

 

3.      
Education: Student’s future learning
patterns, effective learning techniques are identified.

Approached Used: Predictive
Analysis, Machine Learning, etc

 

4.      
Manufacturing Engineering: Relationship
between architecture, customer, product portfolio is established.

Approached Used:
Predictive Analysis,etc.

 

5.      
Fraud Detection: Money Laundering. Theft
is dealt with.

Approached Used:
Decision Algorithms, etc.

 

6.      
Financial Banking: Co-relationship between
Business information and market prices is found out.

Approached Used:
Clustering, etc.