This post describing my first feel when I completed a Julia basic course.
I am experienced Java developer but I have also osculation with C/C++, Python and Octave languages. For me Julia has something from all those languages.
# for loopforiin1:10,jin1:20println("Hi $i , $j")endforiteminitemsprintln("Hi $item")end# function definitionfunction power(x)# last element is returnes - the same as in Pythonx^2end# other options to define function power(x)=x^2power=x->x^2# immutable sortingsort(x)#mutable sortingsort!(x)
Benchmarks which I saw in course [3] shows that Julia has similar or a little better performance than C code and this about 2 orders of magnitude than Python.
In this article in a nutshell I am describing the Neo4j. This article include topics:
a short description of database,
in what cases it is worth to consider use of graph database,
what are advantages comparing to relational database,
a short description of "graph SQL" - cypher,
a few examples of queries in cypher,
a shortcut how to run Java project with Spring Boot and Spring Data dependences.
Neo4j is a graph database. It is transactional and ACID compliant with native graph storage and processing. It use graph SQL language called Cypher dedicated for graph databases.
Graph databases can be used everywhere where there is a need to archive a graph dependency between objects, so I could say in most cases I know.
Comparing to relational databases, native storing and processing have advantage that matching queries are executed faster than relational queries with exponential cost.
Taking a simple case of customers using some services there is a relation many to many.
In relational database it is required to have a matching table where there are ids of services and using them customers. To connect all customers with single service it is required to find service id then in matching table find customer ids and then in third one find customers.
In graph database every service Node (every object is a node - equivalent of table) stores direct Relation to Node customer. This requires using more storage but it is much faster than matching table relations. Other advantages are:
- auto extending schema model as in other NoSQL databases - adding data of node, relation or property in node or relation schema is automagically extended,
- handle "graph SQL" called Cypher.
Cypher is a dedicated language for graph database. Below I have placed a few examples.
-- simple select from tableMATCH(c:Human)WHEREid=1RETURNc;-- equivalent in SQLSELECT*FROMHumanWHEREid=1;-- simple relationMATCH(p:Person)-[r:ACTED_IN]->(m:Movie)WHEREp.name='Tom'RETURNp,r,m;-- equivalent in SQLSELECT*FROMPersonpJOINRelationronp.id=r.person_idJOINMoviemonm.id=r.movie_idWHEREp.name='Tom'ANDr.type='ACTED_IN'
In more complicated case when there is a need to create chain of relations, ex. who is above employee. Is it typical graph case? In Oracle PL/SQL there is something called "CONNECT BY" query construction but how is in other databases, truly I don't know. In MySQL I saw a recurrent procedure storing each level in temporary table, so how is in Cypher?
1
2
3
4
5
6
7
8
9
10
11
12
13
-- this returns supervisors and their supervisorMATCHpath=(n:Person)-[r:REPORTS_TO*]->(s)WHEREn.name='Tom'RETURNsORDERBYlength(path)-- case with subordinates by supervisor idMATCHpath=(n)-[r:REPORTS_TO*]->(s:Person)WHEREid(s)=12RETURNnORDERBYlength(path)
-- delete all schemaMATCH(n)DETACHDELETEn-- create modes and data setCREATE(:Car:Vehicle{type:"Van"})<-[:DRIVES]-(:Human{name:"Basia"})-- update dataMATCH(h:Human)-[d:DRIVES]->(c:Car:Vehicle)WHEREc.type="Van"SETh.name="Ula",c.productionYear="1982"RETURNh,d,c-- constraint - unique fieldCREATECONSTRAINTON(h:Human)ASSERTh.nameISUNIQUE-- delete data matching queryMATCH(c:Car:Vehicle)<-[d:DRIVES]-(h:Human)DELETEc,d,h
Spring Boot project.
In Spring Boot with Spring Data it is required only to add spring-boot-starter-data-neo4j artefact, neo4j properties in path spring.data.neo4j.*, @EnableNeo4jRepositories in configuration and it is possible to create node entities.
In background there is added dependency to org.neo4j:neo4j-orgm-* artefacts and spring-data-neo4j and also org.neo4j.driver artefact.
I was working on: - docker image of neo4j v.: 4.1.1 without auth. - JDK11 - Spring Boot v.:2.2.4
This article is a shortcut of docker tools. These tools are commonly used in micro-service architecture:
Docker
Docker Compose
Docker
is a platform to run application using containers. Containers are created on basis of images created incrementally similar to code repositories, layer after layer.
Container is a environment to run isolated application. It doesn't use their own operating system as virtualized machines. Container share it with host that's why container stand up in a seconds and is lighter for physical machine instead of stand up minutes as virtual machine. That's why docker is commonly used to create instances of application.
Docker can be used interactively, from console. Below most useful commands:
docker ps - show all running container docker images- show images in local repository docker run -d [image_name] - run image in daemon mode docker exec -it [container_id] "[command to run in the container, ex /bin/sh]" - plug in and execute command on specific container docker container logs [container_id] - print logs from container docker pull [image name] - pull image from external images repository
but the biggest benefit of docker is that can be used by scripts, so all process is repeatable and can be automatized. Default docker file is Dockerfile. Below some example:
# base image to this buildFROMopenjdk:8-jdk-alpine# define what directory should be mounted to host # - mounted directories are created in /var/lib/docker/volumesVOLUME/tmp# only inform what ports can expose applicationEXPOSE 8080# define variableARG JAR_FILE=target/*.jar# define variable using environment variable, or "v.1.0.0" if not defined. #ENV override ARG variable. Example execution with variable: # $ docker build --build-arg CONT_IMG_VER=v2.0.1 .ENVSOME_ENV_VAR${CONT_IMG_VER:-v1.0.0}#copy file from host to container storageCOPY${JAR_FILE}app.jar#copy file from host to container storage, but comparing to COPY # can also get file from url and extract tar fileADD${JAR_FILE}app.jar# run command in containerRUNuname-a# health check command - docker is checking if application is working properlyHEALTHCHECK--interval=5m--timeout=3s--retries=5\CMDcurl-fhttp://localhost/||exit1# run application as goal of this imageENTRYPOINT["java","-Djava.security.egd=file:/dev/./urandom","-jar","/app.jar"]
having Dockerfile it is executed command:
docker build
and then if this image have to be pushed to remote repository
it is a tool to stand up a few containers on basic of docker-compose.yml file. Tool manages with dependences between containers, so by one command it is possible to run many services (containers). Below a few most useful commands:
docker-compose build - build images included in file docker-compose.yml docker-compose up -d - run containers in daemon mode docker-compose down - stop containers docker-compose logs - print logs from containers
# version of file formatversion:"3.3"# definition of services (container templates)services:#name of service mongoDB:# image name - this image is retrieved from remote repository image:library/mongo:4.4.0# container name container_name:"mongoDBcontainerName"# what if application is dead restart:on-failure# ports which should be exposed to host (host port: container port) ports:-27017:27017# images have defined variables, this way are defined their valuesenvironment:MONGO_INITDB_ROOT_USERNAME:sbootMONGO_INITDB_ROOT_PASSWORD:exampleMONGO_INITDB_DATABASE:test# storage mapping ( host : containers path : access mode) volumes:-./src/main/sql/mongo-init.js:/docker-entrypoint-initdb.d/mongo-init.js:roapp:# build properties - this service will be built build:# where is context path on host context:./# docker file dockerfile:Dockerfilecontainer_name:"myApp"# definition of depending on services depends_on:-mongoDB# this defines in container dns names for depending on serviceslinks:-mongoDB
To prepare this article I used:
Docker in version 19.03.6 - provided by system
Docker Compose in version 1.17.1 - provided by system
This article is a continuation of Machine Learning series. I am presenting a few advices presented by Andrew Ng on coursera course. They are useful when building Machine Learning System (MLS). What is about this article:
how prepare data,
how to debug it,
what are skewed classes,
how to carry out ceiling analysis.
Preparing data set:
On small set off data (up to ~ 10-100 000 records) it is recommended to split randomized data set in following proportions:
60% - training records - used to train algorithm to find θ factors giving lowest cost.
20% - cross validation records - to select best configuration of algorithm, ex. for Neutral Network (NN) to check how many layers should have network or to reduce useless features.
20% - test records - to define performance of MLS.
In case of big volume of data set (above 100 000 records) it is recommended to change proportions, to respectively 92% /4%/4%.
Debugging MLS:
To improve MLS it is good to perform error analysis that's why consider:
- usage of more training examples,
- change set of features (less/more/different),
- adding polynomial features,
- change lambda value in regularization factor,
- change number of nodes or layers (refers to NN).
Size of training set - below I added chart showing dependency between cost function and used records in training set (learning curve).
On the left chart it can be noticed that for high bias when added more data not decrease high error. However when function is complicated it can be observed a huge error gap but when it is added more data it slowly decrease for cross-validation data.
This can be manipulated bychanging a set of features (less/more). Below I added chart about dependency between cost (error) and complexity of wanted function and examples of function for one set of data.
How exactly this is done? At first function is trained for training data and then
cross-validation data error is calculated for a few configurations of features.
When it is observed high bias it can mean that wanted function is too simple to prepared data set. It can be required to add new features or create polynomial features from existing features.
When it is observed high variance it can mean that wanted function is too complex. It can be required to remove some features.
It is possible to manipulate bias and variance by changing λ of regularization factor. Below I added 3 charts. For very big λ, just right and λ close or equal 0.
It is noticed that too big λ create almost constant function. When λ is close to 0, regularization factor is negligibly small and can be skipped.
Skewed classes
This term refers to situation when set of data of one category is much larger then set of other category, ex. for binary output, if there is 99 % of examples for "true" category and 1% of examples for "false" category. Then creating logistic regression algorithm and other system returning always "true" it is no so big difference between them. At least 1% of difference in effectiveness - not so bad but systems significantly different.
That's why to compare systems like this they are defined terms:
- true positive,
- true negative,
- false negative,
- false positive
described on draft below:
and measures:
- precision - calculate ratio between true positive and false positive
$$ precision = \frac{TP}{TP + FP} $$
- recall - calculate ratio between true positive and false negative
$$recall = \frac{TP}{TP+FN} $$
What gives a measure for factors precision (P) and recall (R)
$$ F_1score = 2* \frac{P*R}{P+R} $$
so bigger score means better system.
The last term in this article is ceiling analysis - this is more economic term, because focuses on whole system as a set of MLS modules working in pipeline.
This analysis answers for question which module should be improved to get higher accuracy of the application.
1.1 Linear regression - the algorithm adopts factors of an equation to approximate training data and get lowest cost.
In course were presented two methods to archive that:
1.1.1 gradient descent - iterative way - in each iteration a cost function should be closer to a local minimum. The main requirements and uses:
- needs to choose alpha - if too big - increases cost, if too low - increases number of steps to get a minimum of cost function,
- needs many iterations,
- recommended for large number of features,
1.1.2 normal equation - not iterative way to find θ. The main features of this algorithm:
- no alpha factor
- don't iterate to find minimum of cost function
- require to calculate (XT*X)-1 what gives complexity O(n3), so is slow for large number of features
- could meet problem with matrix inversion (require some additional operations to calculation(remove redundant features or use regularization)
1.2 Logistic regression - it is classification algorithm that gives binary output.
For more than 2 classes (n-classes) there is used n-functions algorithm and then to get most possible class, it is chosen function with highest output probability. Met problems:
- choose correct decision boundary
- additional optimization algorithms (Conjugate gradient, BFGS, L-BFGS) - usually faster but more complex.
The goal is to minimize the cost function. For multi-class classification the algorithm looks for function maximizing h function.
1.3 Neutral networks - it is classification algorithm consists of nodes layers reflecting human brain:
- one neutral network layer is exactly logistic regression so neutral network is complex classifier and it can solve more complex problems
- requires initialization of weight by random values to avoid symmetry
- requires calculation of forward and back propagation (this is expensive operation)
There is example of neutral network with 3 layers - 2 input nodes, 3 nodes hidden layer, 2 nodes in output layer and 2 bias nodes.
Function calculating output of node is:
$$ h_{\theta}(x) = \frac{1}{1+e^{-{\theta}^Tx}}$$
Using θ(j-1) (a matrix of weights controlling function mapping from layer j-1 to layer j) it is calculated an activation function factor of node i in layer j and an output from node m of previous layer:
The cost function is minimized by iterative improving θ values. For Neutral Networks it is required to calculate error function. There are following equations to calculate it: for last layer:
$$ \delta = h_{\theta}(x)-y $$
for layers 1...L-1 (where L is number of network layers)
$$ \delta^{(l)} = ((\theta^{(l)})^T\delta^{(l+1)}.*a^{(l)}.*(1-a^{(l)}) $$
and back propagation delta:
$$ \Delta_n = \sum_{i=1}^m\delta_n^i*a_{n-1}$$
and derivative of cost function (adaptation gradient)
regularization factor is removed for first layer. Gradient calculation is very expensive and should be used only as confirmation of simplified numerical solution - approximation of derivative:
There are many methods to create SVM, below only more important:
1.4.1 non-kernel ("linear kernel"): used when there is many features but not many training data. This algorithm is similar to logistic regression
1.4.2 Polynomial kernel - used when there is significant count of training data
1.4.2 Gaussian kernel: used when there is not many features but significant count of training data
$$ min_Θ C \sum_{i=1}^{m}y^{(i)}cost_1(Θ^Tf^{(i)})+(1−y^{(i)})cost_0(θ^Tf^{(i)})+\frac{1}{2}\sum_{j=1}^{n}Θ_j^2 $$
2. Unsupervised Learning - group of algorithms looking for data similarities and aggregate them in defined number of classes.
- if number of classes not forced, it should be defined on basis of a cost function for trained algorithm (elbow method).
2.1 K-means - K number of centroids randomly initialized from training set. Then data are assigned to centroid where cost function is lowest. Iteratively mean of each class is moving to get lowest cost in each class.
This kind of algorithm is used to partitioning data or assign to groups dimensions of products ex. sizes of dresses (S, M, L)
$$J(c^{(1)},...,c^{(m)}, \mu_1,...,\mu_K)= \frac{1}{m}\sum_{i=1}^{m}\lVert x^{(i)} - \mu_{c^{(i)}} \rVert ^2$$
where m - number of training data, K number of centroids (number of classes)
2.2 PCA (Principal Component Analysis) - dimension reductions used in data compression or to reduce data for visualization. Algorithm remove one or more dimensions of each parameter.
Covariance matrix is calculated by:
$$ \Sigma= \frac{1}{m}\sum_{i=1}^m (x^{(i)})(x^{(i)})^T $$
2.3 Anomaly detection - algorithm used to detection anomalies in data. This algorithm can be replaced with supervised learning algorithms but it is used when there is a huge number of correct data and a few or no case showing anomalies. Algorithm used to detect anomalies of engines, CPU load, etc.
it bases on the Gaussian distribution so anomaly is detected if if P(x) < ε, where ε is defined threshold.
3. Other ideas
3.1 Recommender Systems - algorithms used by video streaming portals, social media and stores portals to suggest other films, friends or products which can be interesting for customer. Problem could be resolved by linear regression but it is subjective ratio how something is deep in some category, how much someone like specific characteristic of product. Usually system has only a few information about customer or it has no his preferences. That's why it is used collaborative filtering algorithm.
The goal is to minimize cost function
$$ J(x, \theta)= \frac{1}{2} \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2$$
where nu -a number of customer, nm - a number of products, r(i,j)=1 - flag if customer rated product, y(i,j) - value of customer rating.
3.2 Online learning - system where there is no limit of input data. Algorithm is constantly learning and improving its predictions. This require to use proper α.
Data can be also processed in parallel in batch. This can be archived MapReduce algorithm.
Batch gradient descent:
$$ \theta_j=\theta_j - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} $$
where m is number of data in batch. Each sum calculated in parallel and then combined to one equation.
This is a nutshell of presented algorithms in Andrew's course. More tips and ideas I will present in next article.
I had always aversion to Python but finally I forced myself to try it. I am developing myself in machine learning area and most examples and algorithm sources are in Python - life forced me.
In this post in a nut shell I describe basic and most important topic of language like:
- why it was created?
- who use it and why?
- a very short characteristic of language and biggest differences which I noticed.
Python was created 1985 by Guido van Rossun as an interpreted, interactive, object-oriented high level language. Name of language is after Monty Python's Flying Circus TV comedy series. It was designed to be readable and easy to run in academic environment. It uses dynamic data typing validated in runtime. It supports functional programming and it is possible to compile Python code into bytecode usually in bigger applications.
It is good to add that till this year (2020) there were two major versions: 2.x and 3.x. Since 2008 when version 3 was introduced, the older version has been still developed. Versions are incompatible to themselves, so when you learn Python focus on which version you use.
When I write this post current version is 3.8.3 but I trained on 3.6.9 and I used only a few features of the language.
Creator cared of interactive console so each command can be added add-hoc and executed. Probably that's why this language won in scientific community, where code doesn't need to be compiled to be executed.
Currently in Python we can find a lot of tools and libraries to load data, process it and present it (plotters, etc.). Dynamic data type definition is useful when we experiment with code but from my point of view can be dangerous during runtime, when we can meet type incompatibility in runtime.
How Python stand out from the rest languages? The first difference is that Python required some code layout. It doesn't use semicolons and braces so it requires lines and indentations. This is what doesn't convince me to Python. Of course I like pretty formatted code but I acclimated to braces and I don't belief that we can live without them.
Other differences that language uses words "not", "and" and "or" in conditional statements.
Python widely support list. Programmer can add lists, multiply elements, search and simply filter it by two or there additional signs.
defdecorator(annotatedText):# definition of annotationdeftext_generator(old_function):defnew_function(*args,**kwds):returnannotatedText+' '+old_function(*args,**kwds)returnnew_functionreturntext_generator# it returns the new generator# Usage@decorator('prefix')# text attached before function resuldefreturn_text(text):returntext# Now return_text is decorated and reassigned into itselfprint(return_text('myText'))# 'prefix myText'
In this article I will describe what is Elm, how to start adventure with it and some details about this language. I am still exploring this language so please forgive some mistakes.
1. What is Elm lang?
Elm is statically typed strongly functional language compiled to JavaScript. Structure of code is similar to Python - Elm
doesn't use braces but requires indentations. Strong typing protects developer from most of technical errors and unknown state of application. All technical errors are caught
in compile time and developer is informed about them by detailed messages which usually contains suggestion how to fix it.
2. What tools includes Elm?
Elm command support development and module upgrading. The most useful command are:
elm init - initialize project structure. creates src directory and elm.json file.
elm repl - starts interactive programming session,
elm reactor - runs local web server to see project
elm make - compile code to JavaScript
elm install - fetches packages
and less popular:
elm-test init - creates tests dictionary with example sources and updates test dependence in elm.json
elm bump - updates version of packages depending on this changing package
elm diff - detects changes between versions of packages
elm publish - publish your code in elm lang repository
For more detail please check documentation.
3. How to start?
Using npm Elm tools can be installed by few commands:
npm install elm
npm install elm-format
npm install elm-test
and then to initialize first project
elm init
and initialize test to this project
elm-test init
In project dictionary there are created directories and files:
src - directory where should be stored production sources
tests - directory with test sources
elm.json - file with project description and dependence, ex. below
When project is initialized, we can create first elm application.
I use IntelliJ with Elm plug-in, however it is possible to create elm source file in notepad and save file with elm suffix.
When elm file is created, it should be compiled to JavaScript. It is done by command
elm make src/Main.elm
By default is created a file "index.html" with JavaScript included.
4. Architecture
In a null shell about architecture of Elm.
Elm uses pattern Model View Update. To update view is used virtual DOM tree,
where each update operation creates new copy of virtual DOM tree, then each
new copy is compared with previous one and then all changes are applied
finally on real DOM tree in one big batch. This solution much improves changes on real DOM tree.
All variables in Elm are immutable by design. It offers:
- simple types:
Bool
Int
Float
Char
String
- complex types:
typed List is a linked list what simplify operations on it. List can be created by collecting elements one type in square brackets
[elem, elem, elem]
or add new elements by calling
elem :: [elem]
array is also typed as List and can be created from List. Array allows for direct access to each field
tuple is a set of different type elements and is typed as well. tuple is created by collecting elements in round brackets
( elemA, elemB, elemC )
record is a structure of data. Record is created by collecting name of data and values in braces
var1 = { field1 = elemA, field2 = elemB }
or
var1 = RecordType elemA elemB
where record's variables must be in the same order as in definition
type alias RecordType = { field1 : String, field2 : Int}
Maybe is wrapper on object to avoid null pointers. It contains values: Just with value or Nothing.
- custom type - created by developer
type UserStatus = Regular | Visitor
- special types
"_" has special meaning. It represents any type. It can be used as default value in case construction or as unused input of function
unit type "()" - represent empty value
inline function requires "\" before declaration
\elem -> elem + 1
redirecting function result to the funtion on the left "<|" or right "|>" function
Let / if / case constructions
let
definition in
function body
case variable of
case_element -> body handling case _ -> body of default handling
if condition then
body else
body
Modules:
When application is bigger and bigger it is required to split code into separated files. Elm defines each separate file as module. Each module can contain private or public elements, what is defined in header of elm file.
there is also possible to make mix of those solutions, ex.
import Module as M exposing (exposing_fun1,exposing_fun2)
if there is need to move modules to subfolders, module name is preceded with folder path separated with dot (similar to Java packages), ex
module folder1.folder2.ModuleName exposing (..)
To compile application it is needed only to indicate main module of application.
source: https://elmprogramming.com/
Ports
Elm can be run as separated from surrounding world or can communicate with it. When Elm need to communicate with JavaScript it is required to add port specifier.
1
moduleMainTableexposing(..)
When communication is from Elm to JavaScrit it is required only defining port function in Elm and callback function in JavaScript.
1
2
3
4
5
6
7
---ELM---portsendData:String->Cmdmsg--JavaScript----app.ports.sendData.subscribe(function(data){alert("Data from Elm: "+data);});
In other way in Elm it is required to define subscriptions parameter in Browser.element, handling port function and in JavaScript code call function.
Similar to other languages, "main" function is defined as entry point to application. If compilation is run with other output than /dev/null, error is thrown.
6. Test
At the beginning it is required to install elm-test
npm install elm-test
what modifies the file "elm.json".
Test module shares developers a few tools, what are:
Test - test definition
Expect - set of assertions
Fuzz - tool to generate random data and run test for each generated value