This article is a shortcut of docker tools. These tools are commonly used in micro-service architecture:
Docker
Docker Compose
Docker
is a platform to run application using containers. Containers are created on basis of images created incrementally similar to code repositories, layer after layer.
Container is a environment to run isolated application. It doesn't use their own operating system as virtualized machines. Container share it with host that's why container stand up in a seconds and is lighter for physical machine instead of stand up minutes as virtual machine. That's why docker is commonly used to create instances of application.
Docker can be used interactively, from console. Below most useful commands:
docker ps - show all running container docker images- show images in local repository docker run -d [image_name] - run image in daemon mode docker exec -it [container_id] "[command to run in the container, ex /bin/sh]" - plug in and execute command on specific container docker container logs [container_id] - print logs from container docker pull [image name] - pull image from external images repository
but the biggest benefit of docker is that can be used by scripts, so all process is repeatable and can be automatized. Default docker file is Dockerfile. Below some example:
# base image to this buildFROMopenjdk:8-jdk-alpine# define what directory should be mounted to host # - mounted directories are created in /var/lib/docker/volumesVOLUME/tmp# only inform what ports can expose applicationEXPOSE 8080# define variableARG JAR_FILE=target/*.jar# define variable using environment variable, or "v.1.0.0" if not defined. #ENV override ARG variable. Example execution with variable: # $ docker build --build-arg CONT_IMG_VER=v2.0.1 .ENVSOME_ENV_VAR${CONT_IMG_VER:-v1.0.0}#copy file from host to container storageCOPY${JAR_FILE}app.jar#copy file from host to container storage, but comparing to COPY # can also get file from url and extract tar fileADD${JAR_FILE}app.jar# run command in containerRUNuname-a# health check command - docker is checking if application is working properlyHEALTHCHECK--interval=5m--timeout=3s--retries=5\CMDcurl-fhttp://localhost/||exit1# run application as goal of this imageENTRYPOINT["java","-Djava.security.egd=file:/dev/./urandom","-jar","/app.jar"]
having Dockerfile it is executed command:
docker build
and then if this image have to be pushed to remote repository
it is a tool to stand up a few containers on basic of docker-compose.yml file. Tool manages with dependences between containers, so by one command it is possible to run many services (containers). Below a few most useful commands:
docker-compose build - build images included in file docker-compose.yml docker-compose up -d - run containers in daemon mode docker-compose down - stop containers docker-compose logs - print logs from containers
# version of file formatversion:"3.3"# definition of services (container templates)services:#name of service mongoDB:# image name - this image is retrieved from remote repository image:library/mongo:4.4.0# container name container_name:"mongoDBcontainerName"# what if application is dead restart:on-failure# ports which should be exposed to host (host port: container port) ports:-27017:27017# images have defined variables, this way are defined their valuesenvironment:MONGO_INITDB_ROOT_USERNAME:sbootMONGO_INITDB_ROOT_PASSWORD:exampleMONGO_INITDB_DATABASE:test# storage mapping ( host : containers path : access mode) volumes:-./src/main/sql/mongo-init.js:/docker-entrypoint-initdb.d/mongo-init.js:roapp:# build properties - this service will be built build:# where is context path on host context:./# docker file dockerfile:Dockerfilecontainer_name:"myApp"# definition of depending on services depends_on:-mongoDB# this defines in container dns names for depending on serviceslinks:-mongoDB
To prepare this article I used:
Docker in version 19.03.6 - provided by system
Docker Compose in version 1.17.1 - provided by system
This article is a continuation of Machine Learning series. I am presenting a few advices presented by Andrew Ng on coursera course. They are useful when building Machine Learning System (MLS). What is about this article:
how prepare data,
how to debug it,
what are skewed classes,
how to carry out ceiling analysis.
Preparing data set:
On small set off data (up to ~ 10-100 000 records) it is recommended to split randomized data set in following proportions:
60% - training records - used to train algorithm to find θ factors giving lowest cost.
20% - cross validation records - to select best configuration of algorithm, ex. for Neutral Network (NN) to check how many layers should have network or to reduce useless features.
20% - test records - to define performance of MLS.
In case of big volume of data set (above 100 000 records) it is recommended to change proportions, to respectively 92% /4%/4%.
Debugging MLS:
To improve MLS it is good to perform error analysis that's why consider:
- usage of more training examples,
- change set of features (less/more/different),
- adding polynomial features,
- change lambda value in regularization factor,
- change number of nodes or layers (refers to NN).
Size of training set - below I added chart showing dependency between cost function and used records in training set (learning curve).
On the left chart it can be noticed that for high bias when added more data not decrease high error. However when function is complicated it can be observed a huge error gap but when it is added more data it slowly decrease for cross-validation data.
This can be manipulated bychanging a set of features (less/more). Below I added chart about dependency between cost (error) and complexity of wanted function and examples of function for one set of data.
How exactly this is done? At first function is trained for training data and then
cross-validation data error is calculated for a few configurations of features.
When it is observed high bias it can mean that wanted function is too simple to prepared data set. It can be required to add new features or create polynomial features from existing features.
When it is observed high variance it can mean that wanted function is too complex. It can be required to remove some features.
It is possible to manipulate bias and variance by changing λ of regularization factor. Below I added 3 charts. For very big λ, just right and λ close or equal 0.
It is noticed that too big λ create almost constant function. When λ is close to 0, regularization factor is negligibly small and can be skipped.
Skewed classes
This term refers to situation when set of data of one category is much larger then set of other category, ex. for binary output, if there is 99 % of examples for "true" category and 1% of examples for "false" category. Then creating logistic regression algorithm and other system returning always "true" it is no so big difference between them. At least 1% of difference in effectiveness - not so bad but systems significantly different.
That's why to compare systems like this they are defined terms:
- true positive,
- true negative,
- false negative,
- false positive
described on draft below:
and measures:
- precision - calculate ratio between true positive and false positive
$$ precision = \frac{TP}{TP + FP} $$
- recall - calculate ratio between true positive and false negative
$$recall = \frac{TP}{TP+FN} $$
What gives a measure for factors precision (P) and recall (R)
$$ F_1score = 2* \frac{P*R}{P+R} $$
so bigger score means better system.
The last term in this article is ceiling analysis - this is more economic term, because focuses on whole system as a set of MLS modules working in pipeline.
This analysis answers for question which module should be improved to get higher accuracy of the application.
1.1 Linear regression - the algorithm adopts factors of an equation to approximate training data and get lowest cost.
In course were presented two methods to archive that:
1.1.1 gradient descent - iterative way - in each iteration a cost function should be closer to a local minimum. The main requirements and uses:
- needs to choose alpha - if too big - increases cost, if too low - increases number of steps to get a minimum of cost function,
- needs many iterations,
- recommended for large number of features,
1.1.2 normal equation - not iterative way to find θ. The main features of this algorithm:
- no alpha factor
- don't iterate to find minimum of cost function
- require to calculate (XT*X)-1 what gives complexity O(n3), so is slow for large number of features
- could meet problem with matrix inversion (require some additional operations to calculation(remove redundant features or use regularization)
1.2 Logistic regression - it is classification algorithm that gives binary output.
For more than 2 classes (n-classes) there is used n-functions algorithm and then to get most possible class, it is chosen function with highest output probability. Met problems:
- choose correct decision boundary
- additional optimization algorithms (Conjugate gradient, BFGS, L-BFGS) - usually faster but more complex.
The goal is to minimize the cost function. For multi-class classification the algorithm looks for function maximizing h function.
1.3 Neutral networks - it is classification algorithm consists of nodes layers reflecting human brain:
- one neutral network layer is exactly logistic regression so neutral network is complex classifier and it can solve more complex problems
- requires initialization of weight by random values to avoid symmetry
- requires calculation of forward and back propagation (this is expensive operation)
There is example of neutral network with 3 layers - 2 input nodes, 3 nodes hidden layer, 2 nodes in output layer and 2 bias nodes.
Function calculating output of node is:
$$ h_{\theta}(x) = \frac{1}{1+e^{-{\theta}^Tx}}$$
Using θ(j-1) (a matrix of weights controlling function mapping from layer j-1 to layer j) it is calculated an activation function factor of node i in layer j and an output from node m of previous layer:
The cost function is minimized by iterative improving θ values. For Neutral Networks it is required to calculate error function. There are following equations to calculate it: for last layer:
$$ \delta = h_{\theta}(x)-y $$
for layers 1...L-1 (where L is number of network layers)
$$ \delta^{(l)} = ((\theta^{(l)})^T\delta^{(l+1)}.*a^{(l)}.*(1-a^{(l)}) $$
and back propagation delta:
$$ \Delta_n = \sum_{i=1}^m\delta_n^i*a_{n-1}$$
and derivative of cost function (adaptation gradient)
regularization factor is removed for first layer. Gradient calculation is very expensive and should be used only as confirmation of simplified numerical solution - approximation of derivative:
There are many methods to create SVM, below only more important:
1.4.1 non-kernel ("linear kernel"): used when there is many features but not many training data. This algorithm is similar to logistic regression
1.4.2 Polynomial kernel - used when there is significant count of training data
1.4.2 Gaussian kernel: used when there is not many features but significant count of training data
$$ min_Θ C \sum_{i=1}^{m}y^{(i)}cost_1(Θ^Tf^{(i)})+(1−y^{(i)})cost_0(θ^Tf^{(i)})+\frac{1}{2}\sum_{j=1}^{n}Θ_j^2 $$
2. Unsupervised Learning - group of algorithms looking for data similarities and aggregate them in defined number of classes.
- if number of classes not forced, it should be defined on basis of a cost function for trained algorithm (elbow method).
2.1 K-means - K number of centroids randomly initialized from training set. Then data are assigned to centroid where cost function is lowest. Iteratively mean of each class is moving to get lowest cost in each class.
This kind of algorithm is used to partitioning data or assign to groups dimensions of products ex. sizes of dresses (S, M, L)
$$J(c^{(1)},...,c^{(m)}, \mu_1,...,\mu_K)= \frac{1}{m}\sum_{i=1}^{m}\lVert x^{(i)} - \mu_{c^{(i)}} \rVert ^2$$
where m - number of training data, K number of centroids (number of classes)
2.2 PCA (Principal Component Analysis) - dimension reductions used in data compression or to reduce data for visualization. Algorithm remove one or more dimensions of each parameter.
Covariance matrix is calculated by:
$$ \Sigma= \frac{1}{m}\sum_{i=1}^m (x^{(i)})(x^{(i)})^T $$
2.3 Anomaly detection - algorithm used to detection anomalies in data. This algorithm can be replaced with supervised learning algorithms but it is used when there is a huge number of correct data and a few or no case showing anomalies. Algorithm used to detect anomalies of engines, CPU load, etc.
it bases on the Gaussian distribution so anomaly is detected if if P(x) < ε, where ε is defined threshold.
3. Other ideas
3.1 Recommender Systems - algorithms used by video streaming portals, social media and stores portals to suggest other films, friends or products which can be interesting for customer. Problem could be resolved by linear regression but it is subjective ratio how something is deep in some category, how much someone like specific characteristic of product. Usually system has only a few information about customer or it has no his preferences. That's why it is used collaborative filtering algorithm.
The goal is to minimize cost function
$$ J(x, \theta)= \frac{1}{2} \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2$$
where nu -a number of customer, nm - a number of products, r(i,j)=1 - flag if customer rated product, y(i,j) - value of customer rating.
3.2 Online learning - system where there is no limit of input data. Algorithm is constantly learning and improving its predictions. This require to use proper α.
Data can be also processed in parallel in batch. This can be archived MapReduce algorithm.
Batch gradient descent:
$$ \theta_j=\theta_j - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} $$
where m is number of data in batch. Each sum calculated in parallel and then combined to one equation.
This is a nutshell of presented algorithms in Andrew's course. More tips and ideas I will present in next article.