Monday, 10 August 2020

This week 3/2020 - Machine Learning

This article is an abstract of topics I met during Machine Learning course on coursera by Andrew Ng.


1.Supervised Learning - type of algorithms which in learning process uses a training set with input and correct output.

Some function (hypothesis):
$$ h_{\theta}(x) = {\theta}_0 + {\theta}_1x $$
Cost function with regularization factor:

$$J({\theta}_0,{\theta}_1) = \frac{1}{2m} * \sum_ {i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2 + \lambda\sum_{j=1}^{n}(\theta_j)^2$$
Goal is to minimize cost function.

1.1 Linear regression - the algorithm adopts factors of an equation to approximate training data and get lowest cost. 

 

In course were presented two methods to archive that:
1.1.1 gradient descent - iterative way - in each iteration a cost function should be closer to a local minimum. The main requirements and uses:
- needs to choose alpha - if too big - increases cost, if too low - increases number of steps to get a minimum of cost function,
- needs many iterations,
- recommended for large number of features,


In loop searching new values (with regularization factor): $${\theta} = {\theta} - {\alpha}*\frac{1}{m} * ((X{\theta})-y)^T X)^T + \frac{\lambda}{m}\begin{bmatrix} 0 \\ \theta_1 \\ \theta_2 \\. \\. \\ \theta_n \end{bmatrix} $$

1.1.2 normal equation - not iterative way to find θ. The main features of this algorithm:
- no alpha factor
- don't iterate to find minimum of cost function
- require to calculate  (XT*X)-1 what gives complexity O(n3), so is slow for large number of features
- could meet problem with matrix inversion (require some additional operations to calculation(remove redundant features or use regularization)

$${\theta} = (X^TX + \lambda\begin{bmatrix} 0 & 0 & ... & 0\\ 0 & 1 & ... & 0 \\ .& .& .& . \\ 0& 0& ...& 1 \end{bmatrix} )^{-1} X^Ty $$



1.2 Logistic regression - it is classification algorithm that gives binary output. 

 

For more than 2 classes (n-classes) there is used n-functions algorithm and then to get most possible class, it is chosen function with highest output probability. Met problems:
- choose correct decision boundary
- additional optimization algorithms (Conjugate gradient, BFGS, L-BFGS) - usually faster but more complex.


Some function (hypothesis):
$$ h_{\theta}(x) = \frac{1}{1+e^{-{\theta}^Tx}}$$
with probability
$$ P(y=1|x;{\theta})$$
a cost function with regularization factor:

$$J({\theta}) = -\frac{1}{m}[y^Tlog(\frac{1}{1+e^{-X{\theta}}})+(1-y)^Tlog(1-\frac{1}{1+e^{-X{\theta}}})] + \frac{\lambda}{2m} \sum_{i=1}^{m}(\theta_i)^2 $$

The goal is to minimize the cost function. For multi-class classification the algorithm looks for function maximizing h function.


1.3 Neutral networks - it is classification algorithm consists of nodes layers reflecting human brain:
- one neutral network layer is exactly logistic regression so neutral network is complex classifier and it can solve more complex problems
- requires initialization of weight by random values to avoid symmetry
- requires calculation of forward and back propagation (this is expensive operation)


There is example of neutral network with 3 layers - 2 input nodes, 3 nodes hidden layer, 2 nodes in output layer and 2 bias nodes.
Function calculating output of node is:
$$ h_{\theta}(x) = \frac{1}{1+e^{-{\theta}^Tx}}$$

Using θ(j-1) (a matrix of weights controlling function mapping from layer j-1 to layer j) it is calculated an activation function factor of node i in layer j and an output from node m of previous layer:

$$a_i^{(j)} = g(\sum_{k=0}^{m}\theta_{ik}^{(j-1)}*a_{k}^{(j-1)}) $$

in a last layer activation factor is just hθ(x). Then it is calculated a cost function with a regularization factor as below:

$$J({\theta}) = -\frac{1}{m}[\sum_{i=1}^{m}\sum_{k=1}^Ky_k^{(i)}log(h_{\theta}(x_k^{(i)})+(1-y_k^{(i)})log(1-h_{\theta}(x_k^{(i)}))] + \frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\theta_j^{(l)})^2 $$

The cost function is minimized by iterative improving θ values. For Neutral Networks it is required to calculate error function. There are following equations to calculate it: for last layer:

$$ \delta = h_{\theta}(x)-y $$
for layers 1...L-1 (where L is number of network layers) $$ \delta^{(l)} = ((\theta^{(l)})^T\delta^{(l+1)}.*a^{(l)}.*(1-a^{(l)}) $$
and back propagation delta:

 $$ \Delta_n = \sum_{i=1}^m\delta_n^i*a_{n-1}$$

and derivative of cost function (adaptation gradient)

$$ \frac{\partial J(\Theta)}{\partial\Theta_{ij}^{(l)}} = \frac{1}{m}(\Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)}) $$

regularization factor is removed for first layer. Gradient calculation is very expensive and should be used only as confirmation of simplified numerical solution - approximation of derivative:

$$ \frac{\partial J(\Theta)}{\partial\Theta_{i}} \approx \frac{J(\Theta_1,...,\Theta_i+\epsilon,...,\Theta_n)-J(\Theta_1,...,\Theta_i-\epsilon,...,\Theta_n)}{2\epsilon} $$

1.4 SVMs (Support Vector Machines or Large Margin Classifier) - it is algorithm separating two or more classes with defined boundaries. 
 
 
Classification of each data is based on similarity function:
$$ f_i=similarity(x,l^{(i)})=exp(−\frac{∣∣x−l^{(i)}∣∣^2}{2σ^2}) $$
There are many methods to create SVM, below only more important:
1.4.1 non-kernel ("linear kernel"): used when there is many features but not many training data. This algorithm is similar to logistic regression
1.4.2 Polynomial kernel - used when there is significant count of training data
1.4.2 Gaussian kernel: used when there is not many features but significant count of training data
$$ min_Θ C \sum_{i=1}^{m}y^{(i)}cost_1(Θ^Tf^{(i)})+(1−y^{(i)})cost_0(θ^Tf^{(i)})+\frac{1}{2}\sum_{j=1}^{n}Θ_j^2 $$


2. Unsupervised Learning - group of algorithms looking for data similarities and aggregate them in defined number of classes.
- if number of classes not forced, it should be defined on basis of a cost function for trained algorithm (elbow method).
2.1 K-means - K number of centroids randomly initialized from training set. Then data are assigned to centroid where cost function is lowest. Iteratively mean of each class is moving to get lowest cost in each class. 
 
 
 
This kind of algorithm is used to partitioning data or assign to groups dimensions of products ex. sizes of dresses (S, M, L)
$$J(c^{(1)},...,c^{(m)}, \mu_1,...,\mu_K)= \frac{1}{m}\sum_{i=1}^{m}\lVert x^{(i)} - \mu_{c^{(i)}} \rVert ^2$$
where m - number of training data, K number of centroids (number of classes)


2.2 PCA (Principal Component Analysis) - dimension reductions used in data compression or to reduce data for visualization. Algorithm remove one or more dimensions of each parameter.


Covariance matrix is calculated by: $$ \Sigma= \frac{1}{m}\sum_{i=1}^m (x^{(i)})(x^{(i)})^T $$
2.3 Anomaly detection - algorithm used to detection anomalies in data. This algorithm can be replaced with supervised learning algorithms but it is used when there is a huge number of correct data and a few or no case showing anomalies. Algorithm used to detect anomalies of engines, CPU load, etc.

$$ P(x;\mu,\sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(x-\mu)^2}{2\sigma^2}) $$

it bases on the Gaussian distribution so anomaly is detected if if P(x) < ε, where ε is defined threshold.



3. Other ideas
3.1 Recommender Systems - algorithms used by video streaming portals, social media and stores portals to suggest other films, friends or products which can be interesting for customer. Problem could be resolved by linear regression but it is subjective ratio how something is deep in some category, how much someone like specific characteristic of product. Usually system has only a few information about customer or it has no his preferences. That's why it is used collaborative filtering algorithm.
The goal is to minimize cost function
$$ J(x, \theta)= \frac{1}{2} \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2$$ where nu -a number of customer, nm - a number of products, r(i,j)=1 - flag if customer rated product, y(i,j) - value of customer rating.
3.2 Online learning - system where there is no limit of input data. Algorithm is constantly learning and improving its predictions. This require to use proper α.
$$ \theta_j=\theta_j - \alpha(h_{\theta}(x)-y)x_j $$
Data can be also processed in parallel in batch. This can be archived MapReduce algorithm.
Batch gradient descent: $$ \theta_j=\theta_j - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} $$ where m is number of data in batch. Each sum calculated in parallel and then combined to one equation.

This is a nutshell of presented algorithms in Andrew's course. More tips and ideas I will present in next article.


Sources:


No comments:

Post a Comment