Hadoop Interview Questions

Big Data and Hadoop is one of the hottest skills to have today. More and more organizations are adopting hadoop as one of their primary data storage. Today you have roles like Hadoop Developer, Hadoop Administrators, Data Engineers, Data Specialists etc which pays quite well as per industry standards. So are you looking for a career in Hadoop? Are you preparing for Hadoop Interview? Today I list dow top 20 hadoop interview questions which is a must to prepare if you are going for Hadoop Interview. So good luck and let us look at these hadoop interview questions.

Basic Hadoop Interview questions

Below are some basic interview questions which are generally asked in any Hadoop interview if you are going for an entry level position. Make sure that you have practiced these well.

Q1	What is Hadoop?
Ans:	Hadoop is a open source software framework, which provides infrastructure and software tools / applications for processing large volume of data. It provides the software framework for distributed storage and processing of data and uses MapReduce programming model.
Q2	What are various components of Hadoop
Ans:	The various components of Hadoop are: HDFS Hadoop MapReduce YARN PIG and HIVE – for data access. HBase – For Data Storage Ambari, Oozie and ZooKeeper – For Data Management and Monitoring Thrift and Avro –For Data Serialization Apache Flume, Sqoop, Chukwa – For Data Integration Apache Mahout and Drill – For Data Intelligence
Q3	Is Hadoop Free?
Ans:	Hadoop is based on open source framework and is hosted with in Apache foundation. However there are vendor specific distributions as well.
Q4:	Name some vendor specific distribution of Hadoop
Ans:	Cloudera, MapR, Microsoft Azure, IBM Infosphere are some example of vendor specific distributions of Hadoop.
Q5:	What are the various Hadoop Configuration files?
Ans	hadoop-env.sh mapred-site.xml core-site.xml yarn-site.xml hdfs-site.xml Master and Slaves
Q6	What is Hadoop Map Reduce?
Ans	MapReduce is a framework which is used to process a large volume of data in parallel across hadoop cluster. It is a two step process – Map And Reduce and hence it is called as MapReduce.
Q7	What are the three modes in which Hadoop can run?
Q8	Standalone mode: Default More. Uses the local FileSystem and a single Java process to run the Hadoop services. Pseudo-distributed mode: This uses a single-node Hadoop deployment to execute all Hadoop services. Fully-distributed mode: This uses separate nodes to run Hadoop master and slave services.
Q9	What are 5 V’s of Hadoop Framewor
Ans	The 5 V’s are : Volume, Velocity, Variety, Veracity and Value
Q10	What are the most common input format in Hadoop
Ans	the 3 most common input format are : text input format sequence file input format key value pair input format
Q11	What is YARN
Ans	YARN is the full form of – Yet Another Resource Negotiator. It is Hadoop’s data processing framework that manages data resources and creates an environment for successful processing.

Intermediate Level Hadoop Interview questions

Below are some intermediate level questions which are quite frequently asked.

Q1	What are active and passive name nodes
Ans	The Name node which runs the hadoop cluster is called as Active Name Node and the Name node is a standby name node which also has the data of active name node and is used only when the active name node crashes for some reason.
Q2	What are different schedulers in hadoop framework?
Ans	The different schedulers in hadoop framework are COSHH, FIFO and Fair Sharing.
Q3	What are the main components of Apache hBase?
Ans	There are 3 main components of Apache HBase. these are region server, Hmaster and Zookeeper
Q4	What is rack awareness?
Ans	“Rack Awareness” is an algorithm that NameNode uses to determine the pattern in which the data blocks and their replicas are stored within Hadoop cluster. This is achieved with the help of rack definitions that reduce the congestion between data nodes contained in the same rack.
Q5	What is Apache Spark?
Ans	Apache Spark is a framework used in realtime data analytics in an distributed computing environment. It is in-memory computation.
Q6	What is HBase
Ans	Hbase is a open source, distributed, multidimensional, scalable, NoSQL database. HBase is written in Java. and provides capabilities like fault tolerance and high throughput.
Q7	What are the components of HBase?
Ans	Region Server, HMaster and Zookeeper
Q8	What is a UDF?
Ans	UDF is user defined functions.
Q9	What is a recordreader?
Ans	recordreader is a class. It takes data from source and converts it into a key value pair.
Q10	What is fault tolerance in HDFS?
Ans	When we say HDFS is fault tolerant, by that we mean than is one of the data node is down, the name node automatically copies the data to different nodes using replica. This is called fault tolerance.
Q11	What is the difference between RDBMS and Hadoop
Ans	RDBMS is structured data in tables, ie rows and columns while Hadoop can store any type of data like Structured, semi structured or unstructured. RDBMS is Schema on Write while Hadoop is Schema on read. Another key difference is the reading of data in RDBMS is fast as data is structured while Hadoop is fast on write as the data needs no structure validation
Q12	What is the difference between HDFS and YARN
Ans	HDFS is a storage unit of Hadoop. It is Hadoop Distributed File System while, YARN (yet another resource negotiator) is a processing framework and provides execution environment to the processes. It basically manages the resources in Hadoop.

Advanced level Hadoop Interview Questions

Below are some advanced level interview questions.

Q1	How will you debug a hadoop code?
Ans
Q2	What is Checkpointing ?
Ans	Checkpointing in HDFS plays a vital role . Checkpointing is a process which involves merging the fsimage along with the latest edit log and creating a new fsimage for the namenode to possess the latest configured metadata of HDFS namespace
Q3	What is ZooKeeper?
Ans	ZooKeeper helps maintain a server state inside the cluster through communication in sessions. It is part of HBase distributed environment.
Q4	What is Speculative Execution in Hadoop?
Ans	In an hadoop network some nodes may run slower. And this might slow down the entire application. To handle this, Hadoop Framework speculates the slow running tasks and creates an equivalent backup for that task. So now the master node executes both the tasks and which ever one is completed first is accepted and other one is discarded. This is called speculatuve execution.
Q5	What are the “Region Server Components”
Ans	WAL (Write Ahead Log), Block Cache, Memstore and Hfile
Q6	What are the various mapreduce configuration parameters?
Ans	Input and output location of Job in distributed file system, Input and output data format, Class with map function, Class with reduce function
Q7	What is a “Distributes Cache”?
Ans	It is a provision given by the MapReduce framework for caching files required by applications.
Q8	Describe the best use case difference between RDBMS and Hadoop
Ans	RDBMS is used for OLTP systems while Hadoop is best used for Data Analytics, Data Discovery ie OLAP systems
Q9	What are the 2 components of YARN
Ans	The 2 components of YARN are the Resource Manager and Node Manager. Resource Manager receives the processing requests. Passes the parts of requests to corresponding NodeManagers as needed.It allocates resources to applications based on the needs. NodeManager is installed on every DataNode and it is responsible for the execution of the task on every single DataNode.
Q10	What is a Block in HDFS?
Ans	Blocks are the the smallest continuous location on the hard drive where data is stored. HDFS stores data as blocks, and distribute it across the Hadoop cluster. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. Hadoop 1 default block size: 64 MB Hadoop 2 default block size: 128 MB Blocks can be configured. The dfs.block.size parameter can be used in the hdfs-site.xml file to set the size of a block in a Hadoop environment.

Conclusion

Check out our category on Top IT Skills.

Friends, Big Data and Hadoop today is one of the key skill sets and one that is high in demand in industry. A career in Big Data can be highly rewarding. I hope these hadoop interview questions will be helpful for you in preparing for your Hadoop Interview. So good luck friends. If you have more questions to add to this list, do post your comments in the comment box below or write back to me at skumar@indiacareeradvice.com. Good luck!