This week I spent a little time during my sickness for exploration of Apache's Big Data tools, ex.
- Apache Zeppelin
- Apache Hadoop
- Apache Spark.
Apache Zeppelin is some kind of notebook which allows to get data from supported repositories, ex Spark, Cassandra and jdbc or postgres data source, and process them as a task in notes. You have to configure only connection in XML or by GUI. Zeppelin support a few kinds of presenting data, ex. Table, grids, diagrams.
Hadoop and Spark are used to distributed data processing. Spark is the new idea of that processing and use mainly distributed memory instead of distributed storage as Hadoop. This solution allows Spark to be up to 100 times faster than Hadoop, because as noticed 90% of time consume reading from and writing to storage.
How works distributed processing?
At the beginning there is a cluster with hundreds or thousands of servers. We have to count one hundred arithmetic task. To be fast and fault tolerant, master node of cluster split task between ex. 200 servers. The same task is resolved on two separated machines. If on would fail, the second one will return us solution. In the end main node collect solutions and return them to client.
Hadoop and Spark are used to processing big data by programing model MapReduce. MapReduce retrieves useful data from huge volume of data archived in Big Data resources as nosql databases, files and other resources.
In my case I prepared a small input file and implemented:
- data producer - class which retrieves interesting data from each row of my file.
- mapReducer - class which implements algorithm of aggregate for data retrieved by producer.
Then I need only execute my classes by hadoop.
If you impression after my article is that Hadoop and Spark are the same tools with a little differences, you are in the wrong. I am not an expert in that subject and I have just written about main differences. My little experience is that are different tools with different modules. Of course probably some of them are shared.
Resources:
No comments:
Post a Comment