In this article, I am going to describe the easiest ways to start with hadoop in a dockerazied environment:
There are 2 images that are pretty good and famous in spinning up new hadoop container, the easiest way as expected from Cloudera as it provides all hadoop features and its eco system in just one single box. let’s start with Cloudera quickstart box first:
First Approach (cloudera/quickstart)
Just run the following command that should spin up new container from image named “cloudera/quickstart” with exposing the most important ports (in the example only cloudera examples, hue interface, and cloudera manager are being exposed) to be accessible in the host machine.
docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 -p 7180:7180 -p 80:80 -p 50070:50070 -v $(pwd):/home/cloudera -w /home/cloudera cloudera/quickstart /usr/bin/docker-quickstart
Here are the most of the common ports in hadoop are listed below, if you need more you can just add it in this format
- 8888 expose hue interface - 7180 expose cloudera manager - 80 expose cloudera examples - 8983 expose port of Web UI solr search - 50070 expose name node web ui interface - 50090 expose secondary name node - 50075 expose data node - 50030 expose job tracker - 50060 expose task trackers - 60010 expose hbase master status - 60030 expose hbase region server - 9095 expose hbase thrift server - 8020 expose hdfs port - 8088 expose job tracker port - 4040 expose port of spark - 18088 expose history server web interface
It should give you a hash of container (its container id), just grab only the first 3 chars from it and use it as below to attach to the container or in other words to get access inside the container (same concept as if you are ssh or remote login to a machine), then start cloudera manager
# docker attach 4f0 # sudo su # cd /home/cloudera/ # ./cloudera-manager
Now you can magically explore or browse home of cloudera manager as if it’s inside your localhost (http://localhost:7180), make sure to start all services:
Your hadoop node is now ready for map-reduce jobs, you can check my next article how to build and submit jobs to mapreduce from Intellij Idea.
It is pretty straight forward to use image named sequenceiq/hadoop-docker however it only contains hdfs and mapreduce with none of hadoop ecosystems being installed there (it’s only hadoop system, no other systems like hive, pig, flume, hbase … etc)
Open your terminal or power shell in windows and simply run the following command:
docker run -t -i sequenceiq/hadoop-docker /etc/bootstrap.sh -bash
you will get the direct access inside the container just type the following to make sure everything is fine:
hdfs dfs -ls /