machine learning intro

21 May 2016

This is the third lesson on Coursera big data specialization notes.

There is four categories of machine learning:

supervized learning feedback supplied explicitly
reinforcement learning feedback supplied by the environment (ex: control theory, decision theory)
game theory feedback supplied by other actors in the system
unsupervised learning no feedback supplied

Learning is build across three core components: Representation, Evaluation, and Optimization.

Representation : what is your classifier? decision tree, neural network, hyperplane
Evaluation : how do we know if a given classifier is good or bad? precison and recall, squared error, likelihood
Optimization : how do you search among all the alternatives? greedy search? gradient descent?

Supervised Learning

Supervized learning is when we want to learn the relationships between inputs and outputs given examples of inputs and associated outputs. We also use a different terminology depending on the attribute to learn:

classification if the attribute is categorical (“nominal”)
regression if the attribute is numeric

Training

The first step we have to do is define on which data to train. In Supervized learning, we have some input data with the associated result. The first thing to do, is to split these data into a training set and a test set. To do so, we can:

split with a fixed % (70%-30% or 80%-20%)
k-fold cross-validation (select k folds without replace and train on set-k , then test on k . repeat for different chunks)
leave-one-out cross validation (special case of previous method with k=1)
bootstrap (generate new training sets by sampling with replacement)

Then here is some usefull methods to train your algorithm:

bootstrap

boostrap is:

Given a dataset of size N
Draw N samples with replacement to create a new dataset
Repeat ~1000 times
compute ~1000 sample statistics and interpret these as repeated experiments

But keep in mind that:

bootstrap may underesstimate the confidence interval for small samples
bootstrap cannot be used to estimate the min or max of a population
samples need to be independant

bagging

bagging consist of training a lot of weak models and use them together to get a strong model.

draw N bootstrap samples
retrain the model on each sample
average the results ( average for regression, majority vote for classification) It work great for overfit models (it decreases variance without changing bias) but it doesn’t help much with underfit/high bias models

Evaluation in Supervized Learning

Here, we will present different methods and usefull terms to evaluate our representation. Note that most of them can be applied to other form of learning!

accuracy

For example, let’s take the titanic dataset we can for example take a representation such as : IF sex=’female’ THEN survive=yes ELSE survive=no It gives us a confusion matrix (square array where diagonal is right)

	Predict No	Predict Yes
True (No)	468	109
True(Yes)	81	223

it gives us the accuracy $\frac{468+233}{468+233 + 81 + 109} = 79\%$ correct

note that accuracy isn’t always enough. You can’t interpret 90% accuracy, it depends of the problem. We need a baseline to compare against. For the future, note that base rate is accuracy of trivially predicting the most-frequent class; and naive rate is accuracy of some simple default or pre-existing model.

Let’s take the following confusion matrix

	Predicted +	Predicted -
True +	a	b
True -	c	d

Some usefull terms are:

lift = $\frac{a/(a+b)}{(a+c)(a+b+c+d)}$
precision = $\frac{a}{a+c}$
recall = sensitivity = $\frac{a}{a+b}$
1-specificity = $\frac{1-d}{c+d}$

Entropy

In order to decide wheter a node in the decision tree is helpful or not we need the notion of entropy $H(X) = E(I(X)) = \sum_x p_x*I(x)$

For example, consider two sequences of coin flips THHTHTTHTH and TTHTHTHHTH . How much information do we get after flipping each coin once? We want some function “information” that satisfies:

$information_{1and2}(p_1 p_2) = information_1(p_1) + information_2(p_2)$ so it’s $I(X) = -\log_2 p_x$

The expected information is entropy: $H(X) = E(I(X)) = \sum_{x} p_x I(x) = -\sum_x p_x log_2{p_x}$ so for flipping a coin, entropy = -(0.5log(0.5) + 0.5log(0.5)) = 1 . So we learned ‘one’ bit of information

Now, let’s consider rolling a die, $p_1 = \frac{1}{6} = p_2 = p_3 = ..$ Entropy = $-6 \times (\frac{1}{6} \times \log_2 \frac{1}{6}) \approx 2.58$ so after rolling a die, I gain more information than flipping a coin. It’s more unpredictable. Now what happens is we have a weighted die $p_1 = p_2 = p_3 = p_4 = p_5 = 0.1 p_6= 0.5$ , Entropy = $-5*(0.1\log_2 0.1) - (0.5\log_2 0.5) = 2.16$ So here we have less information after each try. It’s less unpredictable

gini coefficient

emtropy captured an intuition for “impurity”. We want to choose attributes that split recored into pure classes. The gini coefficient measures inequality $Gini(T) = 1 - \sum\limits_{i=1}^n p_i^2$

overfitting

overfitting is when your learner outputs a classifier that is more accurate on the training data than on test data. Underfittiong often has high bias and low variance whereas overfitting has low bias and high variance. When you have a some data to model your problem, split them into training and test data. Then the difference between the fit on training data and test data measures the model’s ability to generalize.

Representation in Supervized Learning

Let’s now define some representation algorithms.

decision tree

Back at the titanic example, we can make a decision tree such as:

IF pclass='1' THEN
    IF sex='female' THEN survive=yes
    IF sex='male AND age < 5 THEN survive=yes >
 ...

We usually choose the attribute with the highest information gain first (the bigger difference in entropy with our previous choice, so the one with the lower entropy) .

Another parameter we nedd to pay attention is the number of levels of our decision tree: we don’t want to have overfitting (perfectly fit train data but not test data) . Decision tree is prone to overfitting.

random forest

 repeat k times:
  - draw a boostrap sample from the dataset
  - train a decision tree
   - until the tree is maximum size
      - choose next leaf node
      - select m attributes at random from the p available
      - pick the best attribute/split as usual
 - measure out-of-bag error 
   - evaluate against the samples that were not selected in the boostrap
   - provides measures of strength (inverse error rate), coorelation between trees (which increases the forest error rate), and variable importance

finally, make a prediction by majority among the k trees

we talked about variable importance. the key idea is if you scramble the values of a variable and the accuracy of your tree doesn’t change much, then the variable isn’t very important.. Random forests are more difficult to interpret than single trees; understanding variable importance helps.

It’s a method easy to parallelize.

nearest neighbor

Nearest neighbor is really simple, it says plot your point in a space and choose the class of the nearest point to it. Note that this algorithm assumes only numeric attributes.

There is also k-nearest neighbor which says take the class of the majority of the k nearest points. It gives more resilience to noise. For choosing k :

small k –> fast
large k –> bias towards popular labers, ignore outliers

So in order to use this algorithm, we also need to define some similitude function. For that, we can use euclidian distance or cosine similarity(measure of difference between the angles). Euclidian distance problem, disadvantage is that it is sensitive to the number of dimensions. The disadvantage of cosine is that it favors the dominant components

optimization

gradient descent

Gradient descent algorithm work like this:

express your learning problem in terms of a cost function that should be minimized
starting at initial point, sted ‘downhill’ until you reach a minimum
some situations offer a guarantee that the minimum is the global minimum; others don’t

For example, if you have a cost function $J(\theta^{(i)})$ In order to calculate the next coefficient $\theta_0^{(i+1)} <- \theta_0^{(i)} + \alpha \frac{\partial}{\partial \theta_0}J(\theta^{(i)})$

$\alpha$ being the learning rate, usually quite small (ex: 0.001)

One problem of gradient descent is that it might go to a local minimum.

cost functions

Some known cost functions are:

linear regression $J(\theta) = \frac{1}{2n} \sum\limits_{i=1}^{n}(h_{\theta}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2} \theta^2 $
Support Vector Machines (SVM) $J(\theta) = \frac{1}{n} \sum\limits_{i=0}^{n} \max(1-y_i(\theta.x_i),0) + \frac{\lambda}{2} \theta ^2$ with $\theta$ vector of weights, $x_i$ vector of instance data, and $\frac{\lambda}{2} \theta^2$ the regularization term

regularization term

When we have a lot of weights, it is likely that many are correlated (for examples pixels in a image) So as one weight goes up, another goes down to compensate. It may lead to overfitting as weights explode. A solution to avoid this is to enforce some condition on the weights to prefer simple models. The regularization term provide this balance. It is often based on norm, for example:

lasso
ridge regression

Improved gradient descent

optimized version of gradian descent are:

stochastic gradient descent : at each step, pick one random data point then continue as if your entire dataset was just the one point. It is quite faster but require more iterations to converge!
minibatch gradient descent : at each step, pick a small subset of data points then continue as if your entire dataset was just the one point. It is quite faster and more precise than stochastic version!
parallel stochastic gradient descent: in each of k threads, pick a random data point, compute the gradient and update the weights. weights will be mixed

unsupervized learning

Zoubin Ghahramani said in 2004:

almost all work in unsupervised learning can be viewed in terms of learning a probabilistic model of the data

clustering

no precise definition of a cluster
output is always a set of sets of items
items may be points in some multi-dimensional space (find similar)
items may be vertices in a graph (community detection in social networks)

k-means clustering

create clusters by computing the distance between points and the center clusters and assign points to the nearest cluster. then, move the center of the cluster to the center of points belonging to the cluster and iterate. It can be parallelized using map-reduce (map : assign points to cluster, reduce compute new cluster location)

weaknesses of this algorithm are:

you have to know the number of clusters
there is sensitivity to initial starting point (no unique solution)
there is some sensitivity to stopping threshold

DBSCAN

given a dimensional space, group points that are separated by no more than a distance of no more than some epsilon. It has quite some advantages: no need to assume a fixed number of clusters, no dependence on starting conditions, only two parameters distance threshold and minimum neighbors, amenable to spatial indexing techniques: O(n log n) Disadvantages aire that it is sensitive to Euclidian distance measure problems and variable density can be a problem.

See scikit clustering for strenghs and weaknesses of algorithms.

info-T