What I've learned this week?: August 2020

Saturday, 29 August 2020

This week 5/2020 - Docker tools

This article is a shortcut of docker tools. These tools are commonly used in micro-service architecture:

Docker
Docker Compose

Docker

is a platform to run application using containers. Containers are created on basis of images created incrementally similar to code repositories, layer after layer.

Container is a environment to run isolated application. It doesn't use their own operating system as virtualized machines. Container share it with host that's why container stand up in a seconds and is lighter for physical machine instead of stand up minutes as virtual machine. That's why docker is commonly used to create instances of application.

Docker can be used interactively, from console. Below most useful commands:

docker ps - show all running container
docker images - show images in local repository
docker run -d [image_name] - run image in daemon mode
docker exec -it [container_id] "[command to run in the container, ex /bin/sh]" - plug in and execute command on specific container
docker container logs [container_id] - print logs from container
docker pull [image name] - pull image from external images repository

but the biggest benefit of docker is that can be used by scripts, so all process is repeatable and can be automatized. Default docker file is Dockerfile. Below some example:

# base image to this build
FROM openjdk:8-jdk-alpine

# define what directory should be mounted to host 
# - mounted directories are created in /var/lib/docker/volumes
VOLUME /tmp

# only inform what ports can expose application
EXPOSE 8080

# define variable
ARG JAR_FILE=target/*.jar


# define variable using environment variable, or "v.1.0.0" if not defined. 
#ENV override ARG variable. Example execution with variable: 
# $ docker build --build-arg CONT_IMG_VER=v2.0.1 .
ENV SOME_ENV_VAR ${CONT_IMG_VER:-v1.0.0}


#copy file from host to container storage
COPY ${JAR_FILE} app.jar

#copy file from host to container storage, but comparing to COPY 
# can also get file from url and extract tar file
ADD ${JAR_FILE} app.jar

# run command in container
RUN uname -a

# health check command - docker is checking if application is working properly
HEALTHCHECK --interval=5m --timeout=3s --retries=5 \
  CMD curl -f http://localhost/ || exit 1


# run application as goal of this image
ENTRYPOINT ["java","-Djava.security.egd=file:/dev/./urandom","-jar","/app.jar"]

having Dockerfile it is executed command:

docker build

and then if this image have to be pushed to remote repository

docker image tag [image_tag_name] [docker repository path]
docker push [docker repository path]

There is also possible to start a repository on docker container, executing a command:

docker run -d -p 5000:5000 --name registry registry:2

and to remove container registry container

docker container stop registry && docker container rm -v registry

Docker Compose

it is a tool to stand up a few containers on basic of docker-compose.yml file. Tool manages with dependences between containers, so by one command it is possible to run many services (containers). Below a few most useful commands:

docker-compose build - build images included in file docker-compose.yml
docker-compose up -d - run containers in daemon mode
docker-compose down - stop containers
docker-compose logs - print logs from containers

and docker-compose.yml file example:

# version of file format
version: "3.3"
# definition of services (container templates)
services:
 #name of service 
  mongoDB:
# image name - this image is retrieved from remote repository  
    image: library/mongo:4.4.0
# container name     
    container_name: "mongoDBcontainerName"
# what if application is dead    
    restart: on-failure
# ports which should be exposed to host (host port: container port)    
    ports:
    - 27017:27017
# images have defined variables, this way are defined their values
    environment:
      MONGO_INITDB_ROOT_USERNAME: sboot
      MONGO_INITDB_ROOT_PASSWORD: example
      MONGO_INITDB_DATABASE: test
# storage mapping ( host : containers path : access mode)      
    volumes:
      - ./src/main/sql/mongo-init.js:/docker-entrypoint-initdb.d/mongo-init.js:ro

  app:
# build properties - this service will be built 
    build:
# where is context path on host    
      context: ./
# docker file      
      dockerfile: Dockerfile
    container_name: "myApp"      
# definition of depending on services      
    depends_on:
      - mongoDB
# this defines in container dns names for depending on services
    links:
      - mongoDB

To prepare this article I used:

Docker in version 19.03.6 - provided by system
Docker Compose in version 1.17.1 - provided by system

Resources:

[1] Docker

[2] Docker Compose

Thursday, 27 August 2020

This week 4/2020 - Machine Learning - part II

This article is a continuation of Machine Learning series. I am presenting a few advices presented by Andrew Ng on coursera course. They are useful when building Machine Learning System (MLS). What is about this article:

how prepare data,
how to debug it,
what are skewed classes,
how to carry out ceiling analysis.

Preparing data set:

On small set off data (up to ~ 10-100 000 records) it is recommended to split randomized data set in following proportions:

60% - training records - used to train algorithm to find θ factors giving lowest cost.

20% - cross validation records - to select best configuration of algorithm, ex. for Neutral Network (NN) to check how many layers should have network or to reduce useless features.

20% - test records - to define performance of MLS.

In case of big volume of data set (above 100 000 records) it is recommended to change proportions, to respectively 92% /4%/4%.

Debugging MLS:

To improve MLS it is good to perform error analysis that's why consider:

- usage of more training examples,

- change set of features (less/more/different),

- adding polynomial features,

- change lambda value in regularization factor,

- change number of nodes or layers (refers to NN).

Size of training set - below I added chart showing dependency between cost function and used records in training set (learning curve).

On the left chart it can be noticed that for high bias when added more data not decrease high error. However when function is complicated it can be observed a huge error gap but when it is added more data it slowly decrease for cross-validation data.

This can be manipulated by changing a set of features (less/more). Below I added chart about dependency between cost (error) and complexity of wanted function and examples of function for one set of data.

How exactly this is done? At first function is trained for training data and then

cross-validation data error is calculated for a few configurations of features.

When it is observed high bias it can mean that wanted function is too simple to prepared data set. It can be required to add new features or create polynomial features from existing features.

When it is observed high variance it can mean that wanted function is too complex. It can be required to remove some features.

It is possible to manipulate bias and variance by changing λ of regularization factor. Below I added 3 charts. For very big λ, just right and λ close or equal 0.

It is noticed that too big λ create almost constant function. When λ is close to 0, regularization factor is negligibly small and can be skipped.

Skewed classes

This term refers to situation when set of data of one category is much larger then set of other category, ex. for binary output, if there is 99 % of examples for "true" category and 1% of examples for "false" category. Then creating logistic regression algorithm and other system returning always "true" it is no so big difference between them. At least 1% of difference in effectiveness - not so bad but systems significantly different.

That's why to compare systems like this they are defined terms:

- true positive,

- true negative,

- false negative,

- false positive

described on draft below:

and measures:

- precision - calculate ratio between true positive and false positive

$$ precision = \frac{TP}{TP + FP} $$

- recall - calculate ratio between true positive and false negative

$$recall = \frac{TP}{TP+FN} $$

What gives a measure for factors precision (P) and recall (R)

$$ F_1score = 2* \frac{P*R}{P+R} $$

so bigger score means better system.

The last term in this article is ceiling analysis - this is more economic term, because focuses on whole system as a set of MLS modules working in pipeline.

This analysis answers for question which module should be improved to get higher accuracy of the application.

Sources:

[1] Machine Learning coursera course

[2] Error Metrics for Skewed Classes and Using Large Datasets

Monday, 10 August 2020

This week 3/2020 - Machine Learning

This article is an abstract of topics I met during Machine Learning course on coursera by Andrew Ng.

1.Supervised Learning - type of algorithms which in learning process uses a training set with input and correct output.

Some function (hypothesis):

$$ h_{\theta}(x) = {\theta}_0 + {\theta}_1x $$

Cost function with regularization factor:

$$J({\theta}_0,{\theta}_1) = \frac{1}{2m} * \sum_ {i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2 + \lambda\sum_{j=1}^{n}(\theta_j)^2$$

Goal is to minimize cost function.

1.1 Linear regression - the algorithm adopts factors of an equation to approximate training data and get lowest cost.

In course were presented two methods to archive that:
1.1.1 gradient descent - iterative way - in each iteration a cost function should be closer to a local minimum. The main requirements and uses:
- needs to choose alpha - if too big - increases cost, if too low - increases number of steps to get a minimum of cost function,
- needs many iterations,
- recommended for large number of features,

In loop searching new values (with regularization factor): $${\theta} = {\theta} - {\alpha}*\frac{1}{m} * ((X{\theta})-y)^T X)^T + \frac{\lambda}{m}\begin{bmatrix} 0 \\ \theta_1 \\ \theta_2 \\. \\. \\ \theta_n \end{bmatrix} $$

1.1.2 normal equation - not iterative way to find θ. The main features of this algorithm:
- no alpha factor
- don't iterate to find minimum of cost function
- require to calculate (X^T*X)^-1 what gives complexity O(n³), so is slow for large number of features
- could meet problem with matrix inversion (require some additional operations to calculation(remove redundant features or use regularization)

$${\theta} = (X^TX + \lambda\begin{bmatrix} 0 & 0 & ... & 0\\ 0 & 1 & ... & 0 \\ .& .& .& . \\ 0& 0& ...& 1 \end{bmatrix} )^{-1} X^Ty $$

1.2 Logistic regression - it is classification algorithm that gives binary output.

For more than 2 classes (n-classes) there is used n-functions algorithm and then to get most possible class, it is chosen function with highest output probability. Met problems:
- choose correct decision boundary
- additional optimization algorithms (Conjugate gradient, BFGS, L-BFGS) - usually faster but more complex.

Some function (hypothesis):

$$ h_{\theta}(x) = \frac{1}{1+e^{-{\theta}^Tx}}$$

with probability

$$ P(y=1|x;{\theta})$$

a cost function with regularization factor:

$$J({\theta}) = -\frac{1}{m}[y^Tlog(\frac{1}{1+e^{-X{\theta}}})+(1-y)^Tlog(1-\frac{1}{1+e^{-X{\theta}}})] + \frac{\lambda}{2m} \sum_{i=1}^{m}(\theta_i)^2 $$

The goal is to minimize the cost function. For multi-class classification the algorithm looks for function maximizing h function.

1.3 Neutral networks - it is classification algorithm consists of nodes layers reflecting human brain:

- one neutral network layer is exactly logistic regression so neutral network is complex classifier and it can solve more complex problems

- requires initialization of weight by random values to avoid symmetry
- requires calculation of forward and back propagation (this is expensive operation)

There is example of neutral network with 3 layers - 2 input nodes, 3 nodes hidden layer, 2 nodes in output layer and 2 bias nodes.
Function calculating output of node is:

$$ h_{\theta}(x) = \frac{1}{1+e^{-{\theta}^Tx}}$$

Using θ^(j-1) (a matrix of weights controlling function mapping from layer j-1 to layer j) it is calculated an activation function factor of node i in layer j and an output from node m of previous layer:

$$a_i^{(j)} = g(\sum_{k=0}^{m}\theta_{ik}^{(j-1)}*a_{k}^{(j-1)}) $$

in a last layer activation factor is just h_θ(x). Then it is calculated a cost function with a regularization factor as below:

$$J({\theta}) = -\frac{1}{m}[\sum_{i=1}^{m}\sum_{k=1}^Ky_k^{(i)}log(h_{\theta}(x_k^{(i)})+(1-y_k^{(i)})log(1-h_{\theta}(x_k^{(i)}))] + \frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\theta_j^{(l)})^2 $$

The cost function is minimized by iterative improving θ values. For Neutral Networks it is required to calculate error function. There are following equations to calculate it: for last layer:

$$ \delta = h_{\theta}(x)-y $$
for layers 1...L-1 (where L is number of network layers) $$ \delta^{(l)} = ((\theta^{(l)})^T\delta^{(l+1)}.*a^{(l)}.*(1-a^{(l)}) $$
and back propagation delta:

$$ \Delta_n = \sum_{i=1}^m\delta_n^i*a_{n-1}$$

and derivative of cost function (adaptation gradient)

$$ \frac{\partial J(\Theta)}{\partial\Theta_{ij}^{(l)}} = \frac{1}{m}(\Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)}) $$

regularization factor is removed for first layer. Gradient calculation is very expensive and should be used only as confirmation of simplified numerical solution - approximation of derivative:

$$ \frac{\partial J(\Theta)}{\partial\Theta_{i}} \approx \frac{J(\Theta_1,...,\Theta_i+\epsilon,...,\Theta_n)-J(\Theta_1,...,\Theta_i-\epsilon,...,\Theta_n)}{2\epsilon} $$

1.4 SVMs (Support Vector Machines or Large Margin Classifier) - it is algorithm separating two or more classes with defined boundaries.

Classification of each data is based on similarity function:

$$ f_i=similarity(x,l^{(i)})=exp(−\frac{∣∣x−l^{(i)}∣∣^2}{2σ^2}) $$

There are many methods to create SVM, below only more important:

1.4.1 non-kernel ("linear kernel"): used when there is many features but not many training data. This algorithm is similar to logistic regression

1.4.2 Polynomial kernel - used when there is significant count of training data

1.4.2 Gaussian kernel: used when there is not many features but significant count of training data

$$ min_Θ C \sum_{i=1}^{m}y^{(i)}cost_1(Θ^Tf^{(i)})+(1−y^{(i)})cost_0(θ^Tf^{(i)})+\frac{1}{2}\sum_{j=1}^{n}Θ_j^2 $$

2. Unsupervised Learning - group of algorithms looking for data similarities and aggregate them in defined number of classes.
- if number of classes not forced, it should be defined on basis of a cost function for trained algorithm (elbow method).

2.1 K-means - K number of centroids randomly initialized from training set. Then data are assigned to centroid where cost function is lowest. Iteratively mean of each class is moving to get lowest cost in each class.

This kind of algorithm is used to partitioning data or assign to groups dimensions of products ex. sizes of dresses (S, M, L)

$$J(c^{(1)},...,c^{(m)}, \mu_1,...,\mu_K)= \frac{1}{m}\sum_{i=1}^{m}\lVert x^{(i)} - \mu_{c^{(i)}} \rVert ^2$$
where m - number of training data, K number of centroids (number of classes)

2.2 PCA (Principal Component Analysis) - dimension reductions used in data compression or to reduce data for visualization. Algorithm remove one or more dimensions of each parameter.

Covariance matrix is calculated by: $$ \Sigma= \frac{1}{m}\sum_{i=1}^m (x^{(i)})(x^{(i)})^T $$

2.3 Anomaly detection - algorithm used to detection anomalies in data. This algorithm can be replaced with supervised learning algorithms but it is used when there is a huge number of correct data and a few or no case showing anomalies. Algorithm used to detect anomalies of engines, CPU load, etc.

$$ P(x;\mu,\sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(x-\mu)^2}{2\sigma^2}) $$

it bases on the Gaussian distribution so anomaly is detected if if P(x) < ε, where ε is defined threshold.

3. Other ideas
3.1 Recommender Systems - algorithms used by video streaming portals, social media and stores portals to suggest other films, friends or products which can be interesting for customer. Problem could be resolved by linear regression but it is subjective ratio how something is deep in some category, how much someone like specific characteristic of product. Usually system has only a few information about customer or it has no his preferences. That's why it is used collaborative filtering algorithm.

The goal is to minimize cost function
$$ J(x, \theta)= \frac{1}{2} \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2$$ where n_u -a number of customer, n_m - a number of products, r(i,j)=1 - flag if customer rated product, y^(i,j) - value of customer rating.

3.2 Online learning - system where there is no limit of input data. Algorithm is constantly learning and improving its predictions. This require to use proper α.

$$ \theta_j=\theta_j - \alpha(h_{\theta}(x)-y)x_j $$

Data can be also processed in parallel in batch. This can be archived MapReduce algorithm.

Batch gradient descent: $$ \theta_j=\theta_j - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} $$ where m is number of data in batch. Each sum calculated in parallel and then combined to one equation.

This is a nutshell of presented algorithms in Andrew's course. More tips and ideas I will present in next article.