Hadoop Interview Questions

Big Data and Hadoop is one of the hottest skills to have today. More and more organizations are adopting hadoop as one of their primary data storage. Today you have roles like Hadoop Developer, Hadoop Administrators, Data Engineers, Data Specialists etc which pays quite well as per industry standards. So are you looking for a career in Hadoop? Are you preparing for Hadoop Interview? Today I list dow top 20 hadoop interview questions which is a must to prepare if you are going for Hadoop Interview. So good luck and let us look at these hadoop interview questions.

Basic Hadoop Interview questions

Below are some basic interview questions which are generally asked in any Hadoop interview if you are going for an entry level position. Make sure that you have practiced these well.

Q1What is Hadoop?
Ans:Hadoop is a open source software framework, which provides infrastructure and software tools / applications for processing large volume of data. It provides the software framework for distributed storage and processing of data and uses MapReduce programming model.
Q2What are various components of Hadoop
Ans:The various components of Hadoop are:
HDFS
Hadoop MapReduce
YARN
PIG and HIVE – for data access.
HBase – For Data Storage
Ambari, Oozie and ZooKeeper – For Data Management and Monitoring
Thrift and Avro –For Data Serialization
Apache Flume, Sqoop, Chukwa – For Data Integration
Apache Mahout and Drill – For Data Intelligence
Q3Is Hadoop Free?
Ans:Hadoop is based on open source framework and is hosted with in Apache foundation. However there are vendor specific distributions as well.
Q4:Name some vendor specific distribution of Hadoop
Ans:Cloudera, MapR, Microsoft Azure, IBM Infosphere are some example of vendor specific distributions of Hadoop.
Q5:What are the various Hadoop Configuration files?
Anshadoop-env.sh
mapred-site.xml
core-site.xml
yarn-site.xml
hdfs-site.xml
Master and Slaves
Q6What is Hadoop Map Reduce?
AnsMapReduce is a framework which is used to process a large volume of data in parallel across hadoop cluster. It is a two step process – Map And Reduce and hence it is called as MapReduce.
Q7What are the three modes in which Hadoop can run?
Q8Standalone mode: Default More. Uses the local FileSystem and a single Java process to run the Hadoop services.
Pseudo-distributed mode: This uses a single-node Hadoop deployment to execute all Hadoop services.
Fully-distributed mode: This uses separate nodes to run Hadoop master and slave services.
Q9What are 5 V’s of Hadoop Framewor
AnsThe 5 V’s are : Volume, Velocity, Variety, Veracity and Value
Q10What are the most common input format in Hadoop
Ansthe 3 most common input format are :
text input format
sequence file input format
key value pair input format
Q11What is YARN
AnsYARN is the full form of – Yet Another Resource Negotiator.
It is Hadoop’s data processing framework that manages data resources and creates an environment for successful processing.

Intermediate Level Hadoop Interview questions

Below are some intermediate level questions which are quite frequently asked.

Q1What are active and passive name nodes
AnsThe Name node which runs the hadoop cluster is called as Active Name Node and the Name node is a standby name node which also has the data of active name node and is used only when the active name node crashes for some reason.
Q2What are different schedulers in hadoop framework?
AnsThe different schedulers in hadoop framework are COSHH, FIFO and Fair Sharing.
Q3What are the main components of Apache hBase?
AnsThere are 3 main components of Apache HBase. these are
region server, Hmaster and Zookeeper
Q4What is rack awareness?
Ans“Rack Awareness” is an algorithm that NameNode uses to determine the pattern in which the data blocks and their replicas are stored within Hadoop cluster. This is achieved with the help of rack definitions that reduce the congestion between data nodes contained in the same rack.
Q5What is Apache Spark?
AnsApache Spark is a framework used in realtime data analytics in an distributed computing environment. It is in-memory computation.
Q6What is HBase
AnsHbase is a open source, distributed, multidimensional, scalable, NoSQL database.
HBase is written in Java. and provides capabilities like fault tolerance and high throughput.
Q7What are the components of HBase?
AnsRegion Server, HMaster and Zookeeper
Q8What is a UDF?
AnsUDF is user defined functions.
Q9What is a recordreader?
Ansrecordreader is a class. It takes data from source and converts it into a key value pair.
Q10What is fault tolerance in HDFS?
AnsWhen we say HDFS is fault tolerant, by that we mean than is one of the data node is down, the name node automatically copies the data to different nodes using replica. This is called fault tolerance.
Q11What is the difference between RDBMS and Hadoop
AnsRDBMS is structured data in tables, ie rows and columns while Hadoop can store any type of data like Structured, semi structured or unstructured. RDBMS is Schema on Write while Hadoop is Schema on read. Another key difference is the reading of data in RDBMS is fast as data is structured while Hadoop is fast on write as the data needs no structure validation
Q12What is the difference between HDFS and YARN
AnsHDFS is a storage unit of Hadoop. It is Hadoop Distributed File System while, YARN (yet another resource negotiator) is a processing framework and provides execution environment to the processes. It basically manages the resources in Hadoop.

Advanced level Hadoop Interview Questions

Below are some advanced level interview questions.

Q1How will you debug a hadoop code?
Ans
Q2What is Checkpointing ?
AnsCheckpointing in HDFS plays a vital role . Checkpointing is a process which involves merging the fsimage along with the latest edit log and creating a new fsimage for the namenode to possess the latest configured metadata of HDFS namespace
Q3What is ZooKeeper?
AnsZooKeeper helps maintain a server state inside the cluster through communication in sessions. It is part of HBase distributed environment.
Q4What is Speculative Execution in Hadoop?
AnsIn an hadoop network some nodes may run slower. And this might slow down the entire application. To handle this, Hadoop Framework speculates the slow running tasks and creates an equivalent backup for that task. So now the master node executes both the tasks and which ever one is completed first is accepted and other one is discarded.
This is called speculatuve execution.
Q5What are the “Region Server Components”
AnsWAL (Write Ahead Log), Block Cache, Memstore and Hfile
Q6What are the various mapreduce configuration parameters?
AnsInput and output location of Job in distributed file system, Input and output data format, Class with map function, Class with reduce function
Q7What is a “Distributes Cache”?
AnsIt is a provision given by the MapReduce framework for caching files required by applications.
Q8Describe the best use case difference between RDBMS and Hadoop
AnsRDBMS is used for OLTP systems while Hadoop is best used for Data Analytics, Data Discovery ie OLAP systems
Q9What are the 2 components of YARN
AnsThe 2 components of YARN are the Resource Manager and Node Manager.

Resource Manager receives the processing requests. Passes the parts of requests to corresponding NodeManagers as needed.It allocates resources to applications based on the needs.

NodeManager is installed on every DataNode and it is responsible for the execution of the task on every single DataNode.
Q10What is a Block in HDFS?
AnsBlocks are the the smallest continuous location on the hard drive where data is stored. HDFS stores data as blocks, and distribute it across the Hadoop cluster. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.
Hadoop 1 default block size: 64 MB
Hadoop 2 default block size: 128 MB
Blocks can be configured.
The dfs.block.size parameter can be used in the hdfs-site.xml file to set the size of a block in a Hadoop environment.

Conclusion

Check out our category on Top IT Skills.

Friends, Big Data and Hadoop today is one of the key skill sets and one that is high in demand in industry. A career in Big Data can be highly rewarding. I hope these hadoop interview questions will be helpful for you in preparing for your Hadoop Interview. So good luck friends. If you have more questions to add to this list, do post your comments in the comment box below or write back to me at skumar@indiacareeradvice.com. Good luck!

One thought on “Hadoop Interview Questions

  1. Mariela Burtman says:

    Great article. I will be going through many of these issues as well..

Leave a Reply

Your email address will not be published. Required fields are marked *