Statistical Learning

Suppose that we observe a quantitative response YY and p different predictors, X1,X2,...,XpX_1, X_2, . . . , X_p. We assume that there is some relationship between YY and X=(X1,X2,...,Xp)X = (X_1,X_2,...,X_p), which can be written in the very general form

Y=f(X)+ε Y =f(X)+\varepsilon

Here f is some fixed but unknown function of X1,...,XpX_1, . . . , X_p, and ε\varepsilon is a random error term, which is independent of XX and has mean zero.

Why do we need to estimate ff

Prediction

In many situations, a set of inputs XX are readily available, but the output YY cannot be easily obtained. In this setting, since the error term averages to zero, we can predict YY using

Y^=f^(X)\hat{Y} = \hat{f}(X)

where f^\hat{f} represents our estimate for ff, and Y^\hat{Y} represents the resulting prediction for YY.

The accuracy of Y^\hat{Y} as a prediction for YY depends on two quantities, which we will call the reducible error and the irreducible error. In general, f^\hat{f} will not be a perfect estimate for ff, and this inaccuracy will introduce some error. This error is reducible because we can potentially improve the accuracy of f^\hat{f} by using the most appropriate statistical learning technique to estimate ff.

Since YY is a function of ε\varepsilon, Even if we estimate ff very well we cannot reduce this error (Irreducible error).

Why is there an irreducible error?

This is because there can be unmeasurable variations and dependencies

E(YY^)2=E[f(X)+εf^(X)]2=[f(X)f^(X)]2+Var(ε) E(Y-\hat{Y})^2 = E[f(X) + \varepsilon - \hat{f}(X)]^2 = [f(X)-\hat{f}(X)]^2 + Var(\varepsilon)

Inference

One may be interested in the following questions

  1. Which predictors are associated with the response?
  2. What is the relationship between predictor and response?
  3. Can the relationship be summarized as linear or is it more complicated?

How to estimate ff?

Parametric methods

We make some assumptions about ff

f(X)=β0+β1X1+β2X2+...+βpXpf(X) = \beta_0 +\beta_1X_1 +\beta_2X_2 +...+\beta_pX_p

Then finding ff boils down to finding parameters β\beta

The potential disadvantage of parametric method is if the actual ff is too far from our assumption, then our estimations will be poor. This can be reduced by using more flexible models which may cause overfitting

Non-parametric methods

It estimates ff by using near by data points. Since we are not fixing ff, This is the advantage over parametric methods

Disadvantage in this method is it requires a large number of data points to get accurate estimate of ff

Supervised vs Unsupervised learning

Supervised learning is where you have input variables xx and an output variable YY and you use an algorithm to learn the mapping function from the input to the output.

Ex: Regression, Classification

Unsupervised learning is where you only have input data XX and no corresponding output variables.

Ex: Clustering

Assessing Model accuracy

Measuring the quality of fit

Mean squared error MSE=1/n(i=1n(yif^(xi))2)MSE = 1/n(\textstyle\sum_{i=1}^n (y_i - \hat{f}(x_i))^2)

The red color in right side of the figure is test MSE and other one is training MSE

As the flexibility increases beyond the limit training error decreases but the test error increases, This is called overfitting

Bias-Variance tradeoff

Given x0,y0x_0, y_0 as test data

E(y0f^(x0))2=E[y02+f^2(x0)2y0f^(x0)]E(y_0-\hat{f}(x_0))^2 = E[y_0^2 + \hat{f}^2(x_0) - 2y_0\hat{f}(x_0)] E[y02]+E[f^2(x0)]E[2y0f^(x0)]\Rightarrow E[y_0^2] + E[\hat{f}^2(x_0)] - E[2y_0\hat{f}(x_0)] Var(y0)+E[y0]2+Var(f^(x0))+E[f^(x0)]2E[2y0f^(x0)]\Rightarrow Var(y_0) + E[y_0]^2 + Var(\hat{f}(x_0)) + E[\hat{f}(x_0)]^2 - E[2y_0\hat{f}(x_0)] Var(f^(x0))+[Bias(f^(x0))]2+Var(ε)\Rightarrow Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\varepsilon)

From the above equation, In order to minimize test error we need to minimize bias as well as variance

Variance

Signifies the amount by which f^\hat{f} will change if we estimate using different dataset

In general more flexible statistical training methods will have high variance

Bias

Signifies the error that is introduced by approximating a real life problem

No matter how well we try estimate a real life problem using linear function, it won't be accurate. That implies linear regression will have high bias

In general more flexible functions will have low bias

results matching ""

    No results matching ""