Thursday, 27 August 2020

This week 4/2020 - Machine Learning - part II

This article is a continuation of Machine Learning series. I am presenting a few advices presented by Andrew Ng on coursera course. They are useful when building Machine Learning System (MLS). What is about this article:
  • how prepare data,
  • how to debug it,
  • what are skewed classes,
  • how to carry out ceiling analysis.
 
Preparing data set:
On small set off data (up to ~ 10-100 000 records) it is recommended to split randomized data set in following proportions:
  • 60% -  training records - used to train algorithm to find θ factors giving lowest cost.
  • 20% - cross validation records - to select best configuration of algorithm, ex. for Neutral Network (NN) to check how many layers should have network or to reduce useless features.
  • 20% - test records - to define performance of MLS.
In case of big volume of data set (above 100 000 records) it is recommended to change proportions, to respectively 92% /4%/4%.

Debugging MLS:
To improve MLS it is good to perform error analysis that's why consider:
- usage of more training examples,
- change set of features (less/more/different),
- adding polynomial features,
- change lambda value in regularization factor,
- change number of nodes or layers (refers to NN).

Size of training set -  below I added chart showing dependency between cost function and used records in training set (learning curve).


 
On the left chart it can be noticed that for high bias when added more data not decrease high error. However when function is complicated it can be observed a huge error gap but when it is added more data it slowly decrease for cross-validation data. 
This can be manipulated by changing a set of features (less/more). Below I added chart about dependency between cost (error) and complexity of wanted function and examples of function for one set of data.
 
 
 
 
How exactly this is done? At first function is trained for training data and then 
cross-validation data error is calculated for a few configurations of features.
When it is observed high bias it can mean that wanted function is too simple to prepared data set. It can be required to add new features or create polynomial features from existing features.
When it is observed high variance it can mean that wanted function is too complex. It can be required to remove some features.
 
It is possible to manipulate bias and variance by changing λ of regularization factor. Below I added 3 charts. For very big λ, just right and λ close or equal 0.


It is noticed that too big λ create almost constant function. When λ is close to 0, regularization factor is negligibly small and can be skipped.
 

Skewed classes
This term refers to situation when set of data of one category is much larger then set of other category, ex. for binary output, if there is 99 % of examples for "true" category and 1% of examples for "false" category. Then creating logistic regression algorithm and other system returning always "true" it is no so big difference between them. At least 1% of difference in effectiveness - not so bad but systems significantly different.
That's why to compare systems like this they are defined terms: 
- true positive, 
- true negative, 
- false negative,
- false positive
described on draft below:


 
and measures:
- precision - calculate ratio between true positive and false positive
$$ precision = \frac{TP}{TP + FP} $$

- recall - calculate ratio between true positive and false negative

$$recall = \frac{TP}{TP+FN} $$

What gives a measure for factors precision (P) and recall (R)
$$ F_1score = 2* \frac{P*R}{P+R} $$
so  bigger score means better system.
 
The last term in this article is ceiling analysis - this is more economic term, because focuses on whole system as a set of MLS modules working in pipeline.
This analysis answers for question which module should be improved to get higher accuracy of the application.

No comments:

Post a Comment