Hadoop Interview Questions
Big Data and Hadoop is one of the hottest skills to have today. More and more organizations are adopting hadoop as one of their primary data storage. Today you have roles like Hadoop Developer, Hadoop Administrators, Data Engineers, Data Specialists etc which pays quite well as per industry standards. So are you looking for a career in Hadoop? Are you preparing for Hadoop Interview? Today I list dow top 20 hadoop interview questions which is a must to prepare if you are going for Hadoop Interview. So good luck and let us look at these hadoop interview questions.
Basic Hadoop Interview questions
Below are some basic interview questions which are generally asked in any Hadoop interview if you are going for an entry level position. Make sure that you have practiced these well.
Q1 | What is Hadoop? |
Ans: | Hadoop is a open source software framework, which provides infrastructure and software tools / applications for processing large volume of data. It provides the software framework for distributed storage and processing of data and uses MapReduce programming model. |
Q2 | What are various components of Hadoop |
Ans: | The various components of Hadoop are: HDFS Hadoop MapReduce YARN PIG and HIVE – for data access. HBase – For Data Storage Ambari, Oozie and ZooKeeper – For Data Management and Monitoring Thrift and Avro –For Data Serialization Apache Flume, Sqoop, Chukwa – For Data Integration Apache Mahout and Drill – For Data Intelligence |
Q3 | Is Hadoop Free? |
Ans: | Hadoop is based on open source framework and is hosted with in Apache foundation. However there are vendor specific distributions as well. |
Q4: | Name some vendor specific distribution of Hadoop |
Ans: | Cloudera, MapR, Microsoft Azure, IBM Infosphere are some example of vendor specific distributions of Hadoop. |
Q5: | What are the various Hadoop Configuration files? |
Ans | hadoop-env.sh mapred-site.xml core-site.xml yarn-site.xml hdfs-site.xml Master and Slaves |
Q6 | What is Hadoop Map Reduce? |
Ans | MapReduce is a framework which is used to process a large volume of data in parallel across hadoop cluster. It is a two step process – Map And Reduce and hence it is called as MapReduce. |
Q7 | What are the three modes in which Hadoop can run? |
Q8 | Standalone mode: Default More. Uses the local FileSystem and a single Java process to run the Hadoop services. Pseudo-distributed mode: This uses a single-node Hadoop deployment to execute all Hadoop services. Fully-distributed mode: This uses separate nodes to run Hadoop master and slave services. |
Q9 | What are 5 V’s of Hadoop Framewor |
Ans | The 5 V’s are : Volume, Velocity, Variety, Veracity and Value |
Q10 | What are the most common input format in Hadoop |
Ans | the 3 most common input format are : text input format sequence file input format key value pair input format |
Q11 | What is YARN |
Ans | YARN is the full form of – Yet Another Resource Negotiator. It is Hadoop’s data processing framework that manages data resources and creates an environment for successful processing. |
Intermediate Level Hadoop Interview questions
Below are some intermediate level questions which are quite frequently asked.
Q1 | What are active and passive name nodes |
Ans | The Name node which runs the hadoop cluster is called as Active Name Node and the Name node is a standby name node which also has the data of active name node and is used only when the active name node crashes for some reason. |
Q2 | What are different schedulers in hadoop framework? |
Ans | The different schedulers in hadoop framework are COSHH, FIFO and Fair Sharing. |
Q3 | What are the main components of Apache hBase? |
Ans | There are 3 main components of Apache HBase. these are region server, Hmaster and Zookeeper |
Q4 | What is rack awareness? |
Ans | “Rack Awareness” is an algorithm that NameNode uses to determine the pattern in which the data blocks and their replicas are stored within Hadoop cluster. This is achieved with the help of rack definitions that reduce the congestion between data nodes contained in the same rack. |
Q5 | What is Apache Spark? |
Ans | Apache Spark is a framework used in realtime data analytics in an distributed computing environment. It is in-memory computation. |
Q6 | What is HBase |
Ans | Hbase is a open source, distributed, multidimensional, scalable, NoSQL database. HBase is written in Java. and provides capabilities like fault tolerance and high throughput. |
Q7 | What are the components of HBase? |
Ans | Region Server, HMaster and Zookeeper |
Q8 | What is a UDF? |
Ans | UDF is user defined functions. |
Q9 | What is a recordreader? |
Ans | recordreader is a class. It takes data from source and converts it into a key value pair. |
Q10 | What is fault tolerance in HDFS? |
Ans | When we say HDFS is fault tolerant, by that we mean than is one of the data node is down, the name node automatically copies the data to different nodes using replica. This is called fault tolerance. |
Q11 | What is the difference between RDBMS and Hadoop |
Ans | RDBMS is structured data in tables, ie rows and columns while Hadoop can store any type of data like Structured, semi structured or unstructured. RDBMS is Schema on Write while Hadoop is Schema on read. Another key difference is the reading of data in RDBMS is fast as data is structured while Hadoop is fast on write as the data needs no structure validation |
Q12 | What is the difference between HDFS and YARN |
Ans | HDFS is a storage unit of Hadoop. It is Hadoop Distributed File System while, YARN (yet another resource negotiator) is a processing framework and provides execution environment to the processes. It basically manages the resources in Hadoop. |
Advanced level Hadoop Interview Questions
Below are some advanced level interview questions.
Q1 | How will you debug a hadoop code? |
Ans | |
Q2 | What is Checkpointing ? |
Ans | Checkpointing in HDFS plays a vital role . Checkpointing is a process which involves merging the fsimage along with the latest edit log and creating a new fsimage for the namenode to possess the latest configured metadata of HDFS namespace |
Q3 | What is ZooKeeper? |
Ans | ZooKeeper helps maintain a server state inside the cluster through communication in sessions. It is part of HBase distributed environment. |
Q4 | What is Speculative Execution in Hadoop? |
Ans | In an hadoop network some nodes may run slower. And this might slow down the entire application. To handle this, Hadoop Framework speculates the slow running tasks and creates an equivalent backup for that task. So now the master node executes both the tasks and which ever one is completed first is accepted and other one is discarded. This is called speculatuve execution. |
Q5 | What are the “Region Server Components” |
Ans | WAL (Write Ahead Log), Block Cache, Memstore and Hfile |
Q6 | What are the various mapreduce configuration parameters? |
Ans | Input and output location of Job in distributed file system, Input and output data format, Class with map function, Class with reduce function |
Q7 | What is a “Distributes Cache”? |
Ans | It is a provision given by the MapReduce framework for caching files required by applications. |
Q8 | Describe the best use case difference between RDBMS and Hadoop |
Ans | RDBMS is used for OLTP systems while Hadoop is best used for Data Analytics, Data Discovery ie OLAP systems |
Q9 | What are the 2 components of YARN |
Ans | The 2 components of YARN are the Resource Manager and Node Manager. Resource Manager receives the processing requests. Passes the parts of requests to corresponding NodeManagers as needed.It allocates resources to applications based on the needs. NodeManager is installed on every DataNode and it is responsible for the execution of the task on every single DataNode. |
Q10 | What is a Block in HDFS? |
Ans | Blocks are the the smallest continuous location on the hard drive where data is stored. HDFS stores data as blocks, and distribute it across the Hadoop cluster. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. Hadoop 1 default block size: 64 MB Hadoop 2 default block size: 128 MB Blocks can be configured. The dfs.block.size parameter can be used in the hdfs-site.xml file to set the size of a block in a Hadoop environment. |
Conclusion
Check out our category on Top IT Skills.
Friends, Big Data and Hadoop today is one of the key skill sets and one that is high in demand in industry. A career in Big Data can be highly rewarding. I hope these hadoop interview questions will be helpful for you in preparing for your Hadoop Interview. So good luck friends. If you have more questions to add to this list, do post your comments in the comment box below or write back to me at skumar@indiacareeradvice.com. Good luck!
Great article. I will be going through many of these issues as well..