technicalsymposium

Free Advertisement Procedure and Free Email /Whatsapp Updates

Free Email Alerts-Subscribe Below


Enter your Email ID here:

Note:Login & Check Your Email Inbox and Activate Confirmation Link

Technical Round Materials-Bigdata-Free Download

Technical Round Materials-Bigdata-Free Download

1. Compare Hadoop & Spark

Criteria

Hadoop

Spark

Dedicated storage

HDFS

None

Speed of processing

average

excellent

Libraries

Separate tools available

Spark Core, SQL, Streaming, MLlib, GraphX

2. What are real-time industry applications of Hadoop?

Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and distributed computing of large volumes of data. It provides rapid, high performance and cost-effective analysis of structured and unstructured data generated on digital platforms and within the enterprise. It is used in almost all departments and sectors today.Some of the instances where Hadoop is used:

  • Managing traffic on streets.
  • Streaming processing.
  • Content Management and Archiving Emails.
  • Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster.
  • Fraud detection and Prevention.
  • Advertisements Targeting Platforms are using Hadoop to capture and analyze click stream, transaction, video and social media data.
  • Managing content, posts, images and videos on social media platforms.
  • Analyzing customer data in real-time for improving business performance.
  • Public sector fields such as intelligence, defense, cyber security and scientific research.
  • Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns, identify rogue traders, more precisely target their marketing campaigns based on customer segmentation, and improve customer satisfaction.
  • Getting access to unstructured data like output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data.

3. How is Hadoop different from other parallel computing systems?

Hadoop is a distributed file system, which lets you store and handle massive amount of data on a cloud of machines, handling data redundancy. Go through this HDFS content to know how the distributed file system works. The primary benefit is that since data is stored in several nodes, it is better to process it in distributed manner. Each node can process the data stored on it instead of spending time in moving it over the network.

On the contrary, in Relational database computing system, you can query data in real-time, but it is not efficient to store data in tables, records and columns when the data is huge.

4. What all modes Hadoop can be run in?

Hadoop can run in three modes:

  • Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.
  • Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
  • Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

5. Explain the major difference between HDFS block and InputSplit.

In simple terms, block is the physical representation of data while split is the logical representation of data present in the block. Split acts a s an intermediary between block and mapper.
Suppose we have two blocks:
Block 1: ii nntteell
Block 2: Ii ppaatt

Now, considering the map, it will read first block from ii till ll, but does not know how to process the second block at the same time. Here comes Split into play, which will form a logical group of Block1 and Block 2 as a single block.

It then forms key-value pair using inputformat and records reader and sends map for further processing With inputsplit, if you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with only 5 maps executing at a time.

However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is processed by single map, consuming more time when the file is bigger.

6. What is distributed cache and what are its benefits?

Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when needed. Once a file is cached for a specific job, hadoop will make it available on each data node both in system and in memory, where map and reduce tasks are executing.Later, you can easily access and read the cache file and populate any collection (like array, hashmap) in your code.

Benefits of using distributed cache are:

  • It distributes simple, read only text/data files and/or complex types like jars, archives and others. These archives are then un-archived at the slave node.
  • Distributed cache tracks the modification timestamps of cache files, which notifies that the files should not be modified until a job is executing currently.

7. Explain the difference between NameNode, Checkpoint NameNode and BackupNode.

  • NameNode is the core of HDFS that manages the metadata – the information of what file maps to what block locations and what blocks are stored on what datanode. In simple terms, it’s the data about the data being stored. NameNode supports a directory tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. It uses following files for namespace:
    fsimage file- It keeps track of the latest checkpoint of the namespace.
    edits file-It is a log of changes that have been made to the namespace since checkpoint.
  • Checkpoint NameNode has the same directory structure as NameNode, and creates checkpoints for namespace at regular intervals by downloading the fsimage and edits file and margining them within the local directory. The new image after merging is then uploaded to NameNode.
    There is a similar node like Checkpoint, commonly known as Secondary Node, but it does not support the ‘upload to NameNode’ functionality.
  • Backup Node provides similar functionality as Checkpoint, enforcing synchronization with NameNode. It maintains an up-to-date in-memory copy of file system namespace and doesn’t require getting hold of changes after regular intervals. The backup node needs to save the current state in-memory to an image file to create a new checkpoint.

8. What are the most common Input Formats in Hadoop?

There are three most common input formats in Hadoop:

  • Text Input Format: Default input format in Hadoop.
  • Key Value Input Format: used for plain text files where the files are broken into lines
  • Sequence File Input Format: used for reading files in sequence

9. Define DataNode and how does NameNode tackle DataNode failures?

DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each datanode sends a heartbeat message to notify that it is alive. If the namenode does noit receive a message from datanode for 10 minutes, it considers it to be dead or out of place, and starts replication of blocks that were hosted on that data node such that they are hosted on some other data node.A BlockReport contains list of all blocks on a DataNode. Now, the system starts to replicate what were stored in dead DataNode.

The NameNode manages the replication of data blocksfrom one DataNode to other. In this process, the replication data transfers directly between DataNode such that the data never passes the NameNode.

10. What are the core methods of a Reducer?

The three core methods of a Reducer are:

1. setup(): this method is used for configuring various parameters like input data size, distributed cache.
public void setup (context)

2. reduce(): heart of the reducer always called once per key with the associated reduced task
public void reduce(Key, Value, context)

3. cleanup(): this method is called to clean temporary files, only once at the end of the task
public void cleanup (context)

11. What is SequenceFile in Hadoop?

Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter classes. The three SequenceFile formats are:

1. Uncompressed key/value records.

2. Record compressed key/value records – only ‘values’ are compressed here.

3. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable.

12. What is Job Tracker role in Hadoop?

Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking the taks progress and fault tolerance).

  • It is a process that runs on a separate node, not on a DataNode often.
  • Job Tracker communicates with the NameNode to identify data location.
  • Finds the best Task Tracker Nodes to execute tasks on given nodes.
  • Monitors individual Task Trackers and submits the overall job back to the client.
  • It tracks the execution of MapReduce workloads local to the slave node.

13. What is the use of RecordReader in Hadoop?

Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into single record. For instance, if our input data is split like:
Row1: Welcome to

Row2: Intellipaat
It will be read as “Welcome to Intellipaat” using RecordReader.

14. What is Speculative Execution in Hadoop?

One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that few slow nodes limit the rest of the program. Tehre are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative Execution.

It creates a duplicate task on another disk. The same input can be processed multiple times in parallel. When most tasks in a job comes to completion, the speculative execution mechanism schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free currently. When these tasks finish, it is intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output.

Speculative execution is by default true in Hadoop. To disable, set mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution
JobConf options to false.

15. What happens if you try to run a Hadoop job with an output directory that is already present?

It will throw an exception saying that the output file directory already exists.

To run the MapReduce job, you need to ensure that the output directory does not exist before in the HDFS.

To delete the directory before running the job, you can use shell:Hadoop fs –rmr /path/to/your/output/Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);

16. How can you debug Hadoop code?

First, check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.

1. Run: “ps –ef | grep –I ResourceManager”
and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.

2. On the basis of RM logs, identify the worker node that was involved in execution of the task.

3. Now, login to that node and run – “ps –ef | grep –iNodeManager”

4. Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.

17. How to configure Replication Factor in HDFS?

hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
You can also modify the replication factor on a per-file basis using the

Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely,

you can also change the replication factor of all the files under a directory.

[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir

18. How to compress mapper output but not the reducer output?

To achieve this compression, you should set:

conf.set("mapreduce.map.output.compress", true)

conf.set("mapreduce.output.fileoutputformat.compress", false)

19. What is the difference between Map Side join and Reduce Side Join?

Map side Join at map side is performed data reaches the map. You need a strict structure for defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than map side join since the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.

20. How can you transfer data from Hive to HDFS?

By writing the query:

hive> insert overwrite directory '/' select * from emp;

You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.

Which language is more suitable for text analytics? R or Python?

Since Python consists of a rich library called Pandas which allows the analysts to use high-level data analysis tools as well as data structures, while R lacks this feature. Hence Python will more suitable for text analytics.

What is a Recommender System?

A recommender system is today widely deployed in multiple fields like movie recommendations, music preferences, social tags, research articles, search queries and so on. The recommender systems work as per collaborative and content-based filtering or by deploying a personality-based approach. This type of system works based on a person’s past behavior in order to build a model for the future. This will predict the future product buying, movie viewing or book reading by people. It also creates a filtering approach using the discrete characteristics of items while recommending additional items.

6. Compare SAS, R and Python programming?

SAS: it is one of the most widely used analytics tools used by some of the biggest companies on earth. It has some of the best statistical functions, graphical user interface, but can come with a price tag and hence it cannot be readily adopted by smaller enterprises
R: The best part about R is that it is an Open Source tool and hence used generously by academia and the research community. It is a robust tool for statistical computation, graphical representation and reporting. Due to its open source nature it is always being updated with the latest features and then readily available to everybody.
Python: Python is a powerful open source programming language that is easy to learn, works well with most other tools and technologies. The best part about Python is that it has innumerable libraries and community created modules making it very robust. It has functions for statistical operation, model building and more.

R and Python are two of the most important programming languages for Machine Learning Algorithms.

7. Explain the various benefits of R language?

The R programming language includes a set of software suite that is used for graphical representation, statistical computing, data manipulation and calculation.
Some of the highlights of R programming environment include the following:

  • An extensive collection of tools for data analysis
  • Operators for performing calculations on matrix and array
  • Data analysis technique for graphical representation
  • A highly developed yet simple and effective programming language
  • It extensively supports machine learning applications
  • It acts as a connecting link between various software, tools and datasets
  • Create high quality reproducible analysis that is flexible and powerful
  • Provides a robust package ecosystem for diverse needs
  • It is useful when you have to solve a data-oriented problem

8. What are the two main components of the Hadoop Framework?

HDFS and YARN are basically the two major components of Hadoop framework.

  • HDFS- Stands for Hadoop Distributed File System. It is the distributed database working on top of Hadoop. It is capable of storing and retrieving bulk of datasets in no time.
  • YARN- Stands for Yet Another Resource Negotiator. It allocates resources dynamically and handles the workloads.

9. How do Data Scientists use Statistics?

Statistics helps Data Scientists to look into the data for patterns, hidden insights and convert Big Data into Big insights. It helps to get a better idea of what the customers are expecting. Data Scientists can learn about the consumer behavior, interest, engagement, retention and finally conversion all through the power of insightful statistics. It helps them to build powerful data models in order to validate certain inferences and predictions. All this can be converted into a powerful business proposition by giving users what they want at precisely when they want it.

What is logistic regression?

It is a statistical technique or a model in order to analyze a dataset and predict the binary outcome. The outcome has to be a binary outcome that is either zero or one or a yes or no. Random Forest is an important technique which is used to do classification, regression and other tasks on data.

11. Why data cleansing is important in data analysis?

With data coming in from multiple sources it is important to ensure that data is good enough for analysis. This is where data cleansing becomes extremely vital. Data cleansing extensively deals with the process of detecting and correcting of data records, ensuring that data is complete and accurate and the components of data that are irrelevant are deleted or modified as per the needs. This process can be deployed in concurrence with data wrangling or batch processing.
Once the data is cleaned it confirms with the rules of the data sets in the system. Data cleansing is an essential part of the data science because the data can be prone to error due to human negligence, corruption during transmission or storage among other things. Data cleansing takes a huge chunk of time and effort of a Data Scientist because of the multiple sources from which data emanates and the speed at which it comes.

Describe univariate, bivariate and multivariate analysis.

As the name suggests these are analysis methodologies having a single, double or multiple variables.
So a univariate analysis will have one variable and due to this there are no relationships, causes. The major aspect of the univariate analysis is to summarize the data and find the patterns within it to make actionable decisions.
A Bivariate analysis deals with the relationship between two sets of data. These sets of paired data come from related sources, or samples. There are various tools to analyze such data including the chi-squared tests and t-tests when the data are having a correlation. If the data can be quantified then it can analyzed using a graph plot or a scatterplot. The strength of the correlation between the two data sets will be tested in a Bivariate analysis.

13. How machine learning is deployed in real world scenarios?

Here are some of the scenarios in which machine learning finds applications in real world:

  • Ecommerce: Understanding the customer churn, deploying targeted advertising, remarketing
  • Search engine: Ranking pages depending on the personal preferences of the searcher
  • Finance: Evaluating investment opportunities & risks, detecting fraudulent transactions
  • Medicare: Designing drugs depending on the patient’s history and needs
  • Robotics: Machine learning for handling situations that are out of the ordinary
  • Social media: Understanding relationships and recommending connections
  • Extraction of information: framing questions for getting answers from databases over the web

14. What are the various aspects of a Machine Learning process?

In this post I will discuss the components involved in solving a problem using machine learning.

  • Domain knowledge
    This is the first step wherein we need to understand how to extract the various features from the data and learn more about the data that we are dealing with. It has got more to do with the type of domain that we are dealing with and familiarizing the system to learn more about it.
  • Feature Selection
    This step has got more to do with the feature that we are selecting from the set of features that we have. Sometimes it happens that there are a lot of features and we have to make an intelligent decision regarding the type of feature that we want to select to go ahead with our machine learning endeavor.
  • Algorithm
    This is a vital step since the algorithms that we choose will have a very major impact on the entire process of machine learning. You can choose between the linear and nonlinear algorithm. Some of the algorithms used are Support Vector Machines, Decision Trees, Naïve Bayes, K-Means Clustering, etc.
  • Training
    This is the most important part of the machine learning technique and this is where it differs from the traditional programming. The training is done based on the data that we have and providing more real world experiences. With each consequent training step the machine gets better and smarter and able to take improved decisions.
  • Evaluation
    In this step we actually evaluate the decisions taken by the machine in order to decide whether it is up to the mark or not. There are various metrics that are involved in this process and we have to closed deploy each of these to decide on the efficacy of the whole machine learning endeavor.
  • Optimization
    This process involves improving the performance of the machine learning process using various optimization techniques. Optimization of machine learning is one of the most vital components wherein the performance of the algorithm is vastly improved. The best part of optimization techniques is that machine learning is not just a consumer of optimization techniques but it also provides new ideas for optimization too.
  • Testing
    Here various tests are carried out and some these are unseen set of test cases. The data is partitioned into test and training set. There are various testing techniques like cross-validation in order to deal with multiple situations.

15. What do you understand by the term Normal Distribution?

It is a set of continuous variable spread across a normal curve or in the shape of a bell curve. It can be considered as a continuous probability distribution and is useful in statistics. It is the most common distribution curve and it becomes very useful to analyze the variables and their relationships when we have the normal distribution curve.
The normal distribution curve is symmetrical. The non-normal distribution approaches the normal distribution as the size of the samples increases. It is also very easy to deploy the Central Limit Theorem. This method helps to make sense of data that is random by creating an order and interpreting the results using a bell-shaped graph.

16. What is Linear Regression?

It is the most commonly used method for predictive analytics. The Linear Regression method is used to describe relationship between a dependent variable and one or independent variable. The main task in the Linear Regression is the method of fitting a single line within a scatter plot. The Linear Regression consists of the following three methods:

  • Determining and analyzing the correlation and direction of the data
  • Deploying the estimation of the model
  • Ensuring the usefulness and validity of the model
    It is extensively used in scenarios where the cause effect model comes into play. For example you want to know the effect of a certain action in order to determine the various outcomes and extent of effect the cause has in determining the final outcome.

17. What is Interpolation and Extrapolation?

The terms of interpolation and extrapolation are extremely important in any statistical analysis. Extrapolation is the determination or estimation using a known set of values or facts by extending it and taking it to an area or region that is unknown. It is the technique of inferring something using data that is available.
Interpolation on the other hand is the method of determining a certain value which falls between a certain set of values or the sequence of values. This is especially useful when you have data at the two extremities of a certain region but you don’t have enough data points at the specific point. This is when you deploy interpolation to determine the value that you need.

18. What is Power Analysis?

The power analysis is a vital part of the experimental design. It is involved with the process of determining the sample size needed for detecting an effect of a given size from a cause with a certain degree of assurance. It lets you deploy specific probability in a sample size constraint.
The various techniques of statistical power analysis and sample size estimation are widely deployed for making statistical judgment that are accurate and evaluate the size needed for experimental effects in practice.
Power analysis lets you understand the sample size estimate so that they are neither high nor low. A low sample size there will be no authentication to provide reliable answers and if it is large there will be wastage of resources.

19. What is K-means? How can you select K for K-means?

K-means clustering can be termed as the basic unsupervised learning algorithm. It is the method of classifying data using a certain set of clusters called as K clusters. It is deployed for grouping data in order to find similarity in the data.
It includes defining the K centers, one each in a cluster. The clusters are defined into K groups with K being predefined. The K points are selected at random as cluster centers. The objects are assigned to their nearest cluster center. The objects within a cluster are as closely related to one another as possible and differ as much as possible to the objects in other clusters. K-means clustering works very well for large sets of data.

20. How is Data modeling different from Database design?

Data Modeling: It can be considered as the first step towards the design of a database. Data modeling creates a conceptual model based on the relationship between various data models. The process involves moving from the conceptual stage to the logical model to the physical schema. It involves the systematic method of applying the data modeling techniques.
Database Design: This is the process of designing the database. The database design creates an output which is a detailed data model of the database. Strictly speaking database design includes the detailed logical model of a database but it can also include physical design choices and storage parameters.

Tell us how big data and Hadoop are related to each other.

Answer: Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.

Note: This question is commonly asked in a big data interview. You can go further to answer this question and try to explain the main components of Hadoop.

4. How is big data analysis helpful in increasing business revenue?

Answer: Big data analysis has become very important for the businesses. It helps businesses to differentiate themselves from others and increase the revenue. Through predictive analytics, big data analytics provides businesses customized recommendations and suggestions. Also, big data analytics enables businesses to launch new products depending on customer needs and preferences. These factors make businesses earn more revenue, and thus companies are using big data analytics. Companies may encounter a significant increase of 5-20% in revenue by implementing big data analytics. Some popular companies those are using big data analytics to increase their revenue is – Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.

5. Explain the steps to be followed to deploy a Big Data solution.

Answer: Followings are the three steps that are followed to deploy a Big Data Solution –

i. Data Ingestion

The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS.

Why is Hadoop used for Big Data Analytics?

Answer: Since data analysis has become one of the key parameters of business, hence, enterprises are dealing with massive amount of structured, unstructured and semi-structured data. Analyzing unstructured data is quite difficult where Hadoop takes major part with its capabilities of

  • Storage
  • Processing
  • Data collection

Moreover, Hadoop is open source and runs on commodity hardware. Hence it is a cost-benefit solution for businesses.

8. What is fsck?

Answer: fsck stands for File System Check. It is a command used by HDFS. This command is used to check inconsistencies and if there is any problem in the file. For example, if there are any missing blocks for a file, HDFS gets notified through this command.

9. What are the main differences between NAS (Network-attached storage) and HDFS?

Answer: The main differences between NAS (Network-attached storage) and HDFS –

  • HDFS runs on a cluster of machines while NAS runs on an individual machine. Hence, data redundancy is a common issue in HDFS. On the contrary, the replication protocol is different in case of NAS. Thus the chances of data redundancy are much less.
  • Data is stored as data blocks in local drives in case of HDFS. In case of NAS, it is stored in dedicated hardware.

10. What is the Command to format the NameNode?

Answer: $ hdfs namenode -format

Big data is not just what you think, it’s a broad spectrum. There are a number of career options in Big Data World. Here is an interesting and explanatory visual on Big Data Careers.

Experience-based Big Data Interview Questions

If you have some considerable experience of working in Big Data world, you will be asked a number of questions in your big data interview based on your previous experience. These questions may be simply related to your experience or scenario based. So, get prepared with these best Big data interview questions and answers –

11. Do you have any Big Data experience? If so, please share it with us.

How to Approach: There is no specific answer to the question as it is a subjective question and the answer depends on your previous experience. Asking this question during a big data interview, the interviewer wants to understand your previous experience and is also trying to evaluate if you are fit for the project requirement.

So, how will you approach the question? If you have previous experience, start with your duties in your past position and slowly add details to the conversation. Tell them about your contributions that made the project successful. This question is generally, the 2nd or 3 rd question asked in an interview. The later questions are based on this question, so answer it carefully. You should also take care not to go overboard with a single aspect of your previous job. Keep it simple and to the point.

12. Do you prefer good data or good models? Why?

How to Approach: This is a tricky question but generally asked in the big data interview. It asks you to choose between good data or good models. As a candidate, you should try to answer it from your experience. Many companies want to follow a strict process of evaluating data, means they have already selected data models. In this case, having good data can be game-changing. The other way around also works as a model is chosen based on good data.

As we already mentioned, answer it from your experience. However, don’t say that having both good data and good models is important as it is hard to have both in real life projects.

13. Will you optimize algorithms or code to make them run faster?

How to Approach: The answer to this question should always be “Yes.” Real world performance matters and it doesn’t depend on the data or model you are using in your project.

The interviewer might also be interested to know if you have had any previous experience in code or algorithm optimization. For a beginner, it obviously depends on which projects he worked on in the past. Experienced candidates can share their experience accordingly as well. However, be honest about your work, and it is fine if you haven’t optimized code in the past. Just let the interviewer know your real experience and you will be able to crack the big data interview.

14. How do you approach data preparation?

How to Approach: Data preparation is one of the crucial steps in big data projects. A big data interview may involve at least one question based on data preparation. When the interviewer asks you this question, he wants to know what steps or precautions you take during data preparation.

As you already know, data preparation is required to get necessary data which can then further be used for modeling purposes. You should convey this message to the interviewer. You should also emphasize the type of model you are going to use and reasons behind choosing that particular model. Last, but not the least, you should also discuss important data preparation terms such as transforming variables, outlier values, unstructured data, identifying gaps, and others.

15. How would you transform unstructured data into structured data?

How to Approach: Unstructured data is very common in big data. The unstructured data should be transformed into structured data to ensure proper data analysis. You can start answering the question by briefly differentiating between the two. Once done, you can now discuss the methods you use to transform one form to another. You might also share the real-world situation where you did it. If you have recently been graduated, then you can share information related to your academic projects.

By answering this question correctly, you are signaling that you understand the types of data, both structured and unstructured, and also have the practical experience to work with these. If you give an answer to this question specifically, you will definitely be able to crack the big data interview.

16. Which hardware configuration is most beneficial for Hadoop jobs?

Dual processors or core machines with a configuration of 4 / 8 GB RAM and ECC memory is ideal for running Hadoop operations. However, the hardware configuration varies based on the project-specific workflow and process flow and need customization accordingly.

17. What happens when two users try to access the same file in the HDFS?

HDFS NameNode supports exclusive write only. Hence, only the first user will receive the grant for file access and the second user will be rejected.

18. How to recover a NameNode when it is down?

The following steps need to execute to make the Hadoop cluster up and running:

1. Use the FsImage which is file system metadata replica to start a new NameNode.

2. Configure the DataNodes and also the clients to make them acknowledge the newly started NameNode.

3. Once the new NameNode completes loading the last checkpoint FsImage which has received enough block reports from the DataNodes, it will start to serve the client.

In case of large Hadoop clusters, the NameNode recovery process consumes a lot of time which turns out to be a more significant challenge in case of routine maintenance.

19. What do you understand by Rack Awareness in Hadoop?

It is an algorithm applied to the NameNode to decide how blocks and its replicas are placed. Depending on rack definitions network traffic is minimized between DataNodes within the same rack. For example, if we consider replication factor as 3, two copies will be placed on one rack whereas the third copy in a separate rack.

20. What is the difference between “HDFS Block” and “Input Split”?

The HDFS divides the input data physically into blocks for processing which is known as HDFS Block.

Input Split is a logical division of data by mapper for mapping operation.

22. What are the common input formats in Hadoop?

Answer: Below are the common input formats in Hadoop –

  • Text Input Format – The default input format defined in Hadoop is the Text Input Format.
  • Sequence File Input Format – To read files in a sequence, Sequence File Input Format is used.
  • Key Value Input Format – The input format used for plain text files (files broken into lines) is the Key Value Input Format.

23. Explain some important features of Hadoop.

Answer: Hadoop supports the storage and processing of big data. It is the best solution for handling big data challenges. Some important features of Hadoop are –

  • Open Source – Hadoop is an open source framework which means it is available free of cost. Also, the users are allowed to change the source code as per their requirements.
  • Distributed Processing – Hadoop supports distributed processing of data i.e. faster processing. The data in Hadoop HDFS is stored in a distributed manner and MapReduce is responsible for the parallel processing of data.
  • Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each block at different nodes, by default. This number can be changed according to the requirement. So, we can recover the data from another node if one node fails. The detection of node failure and recovery of data is done automatically.
  • Reliability – Hadoop stores data on the cluster in a reliable manner that is independent of machine. So, the data stored in Hadoop environment is not affected by the failure of the machine.
  • Scalability – Another important feature of Hadoop is the scalability. It is compatible with the other hardware and we can easily ass the new hardware to the nodes.
  • High Availability – The data stored in Hadoop is available to access even after the hardware failure. In case of hardware failure, the data can be accessed from another path.

24. Explain the different modes in which Hadoop run.

Answer: Apache Hadoop runs in the following three modes –

  • Standalone (Local) Mode – By default, Hadoop runs in a local mode i.e. on a non-distributed, single node. This mode uses the local file system to perform input and output operation. This mode does not support the use of HDFS, so it is used for debugging. No custom configuration is needed for configuration files in this mode.
  • Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a single node just like the Standalone mode. In this mode, each daemon runs in a separate Java process. As all the daemons run on a single node, there is the same node for both the Master and Slave nodes.
  • Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on separate individual nodes and thus forms a multi-node cluster. There are different nodes for Master and Slave nodes.

25. Explain the core components of Hadoop.

Answer: Hadoop is an open source framework that is meant for storage and processing of big data in a distributed manner. The core components of Hadoop are –

  • HDFS (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop. The large data files running on a cluster of commodity hardware are stored in HDFS. It can store data in a reliable manner even when hardware fails.

What are the configuration parameters in a “MapReduce” program?

The main configuration parameters in “MapReduce” framework are:

  • Input locations of Jobs in the distributed file system
  • Output location of Jobs in the distributed file system
  • The input format of data
  • The output format of data
  • The class which contains the map function
  • The class which contains the reduce function
  • JAR file which contains the mapper, reducer and the driver classes

27. What is a block in HDFS and what is its default size in Hadoop 1 and Hadoop 2? Can we change the block size?

Blocks are smallest continuous data storage in a hard drive. For HDFS, blocks are stored across Hadoop cluster.

  • The default block size in Hadoop 1 is: 64 MB
  • The default block size in Hadoop 2 is: 128 MB

Yes, we can change block size by using the parameter – dfs.block.size located in the hdfs-site.xml file.

28. What is Distributed Cache in a MapReduce Framework

Distributed Cache is a feature of Hadoop MapReduce framework to cache files for applications. Hadoop framework makes cached files available for every map/reduce tasks running on the data nodes. Hence, the data files can access the cache file as a local file in the designated job.

29. What are the three running modes of Hadoop?

The three running modes of Hadoop are as follows:

i. Standalone or local : This is the default mode and does not need any configuration. In this mode, all the following components of Hadoop uses local file system and runs on a single JVM –

  • NameNode
  • DataNode
  • ResourceManager
  • NodeManager

ii. Pseudo-distributed : In this mode, all the master and slave Hadoop services are deployed and executed on a single node.

iii. Fully distributed : In this mode, Hadoop master and slave services are deployed and executed on separate nodes.

30. Explain JobTracker in Hadoop

JobTracker is a JVM process in Hadoop to submit and track MapReduce jobs.

JobTracker performs the following activities in Hadoop in a sequence –

  • JobTracker receives jobs that a client application submits to the job tracker
  • JobTracker notifies NameNode to determine data node
  • JobTracker allocates TaskTracker nodes based on available slots.
  • it submits the work on allocated TaskTracker Nodes,
  • JobTracker monitors the TaskTracker nodes.
  • When a task fails, JobTracker is notified and decides how to reallocate the task.

31. What are the different configuration files in Hadoop?

Answer: The different configuration files in Hadoop are –

core-site.xml – This configuration file contains Hadoop core configuration settings, for example, I/O settings, very common for MapReduce and HDFS. It uses hostname a port.

mapred-site.xml – This configuration file specifies a framework name for MapReduce by setting mapreduce.framework.name

hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also specifies default block permission and replication checking on HDFS.

yarn-site.xml – This configuration file specifies configuration settings for ResourceManager and NodeManager.

How can you achieve security in Hadoop?

Answer: Kerberos are used to achieve security in Hadoop. There are 3 steps to access a service while using Kerberos, at a high level. Each step involves a message exchange with a server.

1. Authentication – The first step involves authentication of the client to the authentication server, and then provides a time-stamped TGT (Ticket-Granting Ticket) to the client.

2. Authorization – In this step, the client uses received TGT to request a service ticket from the TGS (Ticket Granting Server).

3. Service Request – It is the final step to achieve security in Hadoop. Then the client uses service ticket to authenticate himself to the server.

34. What is commodity hardware?

Answer: Commodity hardware is a low-cost system identified by less-availability and low-quality. The commodity hardware comprises of RAM as it performs a number of services that require RAM for the execution. One doesn’t require high-end hardware configuration or supercomputers to run Hadoop, it can be run on any commodity hardware.

How do Hadoop MapReduce works?

There are two phases of MapReduce operation.

  • Map phase – In this phase, the input data is split by map tasks. The map tasks run in parallel. These split data is used for analysis purpose.
  • Reduce phase- In this phase, the similar split data is aggregated from the entire collection and shows the result.

37. What is MapReduce? What is the syntax you use to run a MapReduce program?

MapReduce is a programming model in Hadoop for processing large data sets over a cluster of computers, commonly known as HDFS. It is a parallel programming model.

The syntax to run a MapReduce program is – hadoop_jar_file.jar /input_path /output_path.

38. What are the Port Numbers for NameNode, Task Tracker, and Job Tracker?

  • NameNode – Port 50070
  • Task Tracker – Port 50060
  • Job Tracker – Port 50030

39.What are the different file permissions in HDFS for files or directory levels?

Hadoop distributed file system (HDFS) uses a specific permissions model for files and directories. Following user levels are used in HDFS –

  • Owner
  • Group
  • Others.

For each of the user mentioned above following permissions are applicable –

  • read (r)
  • write (w)
  • execute(x).

Above mentioned permissions work differently for files and directories.

For files –

  • The r permission is for reading a file
  • The w permission is for writing a file.

For directories –

  • The r permission lists the contents of a specific directory.
  • The w permission creates or deletes a directory.
  • The X permission is for accessing a child directory.

How to restart all the daemons in Hadoop?

Answer: To restart all the daemons, it is required to stop all the daemons first. The Hadoop directory contains sbin directory that stores the script files to stop and start daemons in Hadoop.

Use stop daemons command /sbin/stop-all.sh to stop all the daemons and then use /sin/start-all.sh command to start all the daemons again.

42. What is the use of jps command in Hadoop?

Answer: The jps command is used to check if the Hadoop daemons are running properly or not. This command shows all the daemons running on a machine i.e. Datanode, Namenode, NodeManager, ResourceManager etc.

43. Explain the process that overwrites the replication factors in HDFS.

Answer: There are two methods to overwrite the replication factors in HDFS –

Method 1: On File Basis

In this method, the replication factor is changed on the basis of file using Hadoop FS shell. The command used for this is:

$hadoop fs – setrep –w2/my/test_file

Here, test_file is the filename that’s replication factor will be set to 2.

Method 2: On Directory Basis

In this method, the replication factor is changed on directory basis i.e. the replication factor for all the files under a given directory is modified.

$hadoop fs –setrep –w5/my/test_dir

Here, test_dir is the name of the directory, the replication factor for the directory and all the files in it will be set to 5.

44. What will happen with a NameNode that doesn’t have any data?

Answer: A NameNode without any data doesn’t exist in Hadoop. If there is a NameNode, it will contain some data in it or it won’t exist.

45. Explain NameNode recovery process.

Answer: The NameNode recovery process involves the below-mentioned steps to make Hadoop cluster running:

  • In the first step in the recovery process, file system metadata replica (FsImage) starts a new NameNode.
  • The next step is to configure DataNodes and Clients. These DataNodes and Clients will then acknowledge new NameNode.
  • During the final step, the new NameNode starts serving the client on the completion of last checkpoint FsImage loading and receiving block reports from the DataNodes.

Note: Don’t forget to mention, this NameNode recovery process consumes a lot of time on large Hadoop clusters. Thus, it makes routine maintenance difficult. For this reason, HDFS high availability architecture is recommended to use.

46. How Is Hadoop CLASSPATH essential to start or stop Hadoop daemons?

CLASSPATH includes necessary directories that contain jar files to start or stop Hadoop daemons. Hence, setting CLASSPATH is essential to start or stop Hadoop daemons.

However, setting up CLASSPATH every time is not the standard that we follow. Usually CLASSPATH is written inside /etc/hadoop/hadoop-env.sh file. Hence, once we run Hadoop, it will load the CLASSPATH automatically.

47. Why is HDFS only suitable for large data sets and not the correct tool to use for many small files?

This is due to the performance issue of NameNode. Usually, NameNode is allocated with huge space to store metadata for the large-scale file. The metadata is supposed to be a from a single file for optimum space utilization and cost benefit. In case of small size files, NameNode does not utilize the entire space which is a performance optimization issue.

48. Why do we need Data Locality in Hadoop? Explain.

Datasets in HDFS store as blocks in DataNodes the Hadoop cluster. During the execution of a MapReduce job the individual Mapper processes the blocks (Input Splits). If the data does not reside in the same node where the Mapper is executing the job, the data needs to be copied from the DataNode over the network to the mapper DataNode.

DFS can handle a large volume of data then why do we need Hadoop framework?

Hadoop is not only for storing large data but also to process those big data. Though DFS(Distributed File System) too can store the data, but it lacks below features-

  • It is not fault tolerant
  • Data movement over a network depends on bandwidth.

50. What is Sequencefileinputformat?

Hadoop uses a specific file format which is known as Sequence file. The sequence file stores data in a serialized key-value pair. Sequencefileinputformat is an input format to read sequence files.

1 – Define Big Data And Explain The Five Vs of Big Data.

One of the most introductory Big Data questions asked during interviews, the answer to this is fairly straightforward-

Big Data is defined as a collection of large and complex unstructured data sets from where insights are derived from Data Analysis using open-source tools like Hadoop.

The five Vs of Big Data are –

  • Volume – Amount of data in Petabytes and Exabytes
  • Variety – Includes formats like videos, audio sources, textual data, etc.
  • Velocity – Everyday data growth which includes conversations in forums, blogs, social media posts, etc.
  • Veracity – Degree of accuracy of data available
  • Value – Deriving insights from collected data to achieve business milestones and new heights

2- How is Hadoop related to Big Data? Describe its components.

Another fairly simple question. Apache Hadoop is an open-source framework used for storing, processing, and analyzing complex unstructured data sets for deriving insights and actionable intelligence for businesses.

The three main components of Hadoop are-

  • MapReduce – A programming model which processes large datasets in parallel
  • HDFS – A Java-based distributed file system used for data storage without prior organization
  • YARN – A framework that manages resources and handles requests from distributed applications

3- Define HDFS and YARN, and talk about their respective components.

The Hadoop Distributed F ile System (HDFS) is the storage unit that’s responsible for storing different types of data blocks in a distributed environment.

The two main components of HDFS are-

  • NameNode – A master node that processes metadata information for data blocks contained in the HDFS
  • DataNode – Nodes which act as slave nodes and simply store the data, for use and processing by the NameNode

The Yet Another Resource Negotiator (YARN) is the processing component of Apache Hadoop and is responsible for managing resources and providing an execution environment for said processes.

The two main components of YARN are-

  • ResourceManager – Receives processing requests and allocates its parts to respective NodeManagers based on processing needs.
  • NodeManager – Executes tasks on every single Data Node

4 – Explain the term ‘Commodity Hardware.’

Commodity Hardware refers to the minimal hardware resources and components, collectively needed, to run the Apache Hadoop framework and related data management tools. Apache Hadoop requires 64-512 GB of RAM to execute tasks, and any hardware that supports its minimum requirements is known as ‘Commodity Hardware.’

5 – Define and describe the term FSCK.

FSCK (File System Check) is a command used to run a Hadoop summary report that describes the state of the Hadoop file system. This command is used to check the health of the file distribution system when one or more file blocks become corrupt or unavailable in the system. FSCK only checks for errors in the system and does not correct them, unlike the traditional FSCK utility tool in Hadoop. The command can be run on the whole system or on a subset of files.

The correct command for FSCK is bin/HDFS FSCK.

6 – What is the purpose of the JPS command in Hadoop?

The JBS command is used to test whether all Hadoop daemons are running correctly or not. It specifically checks daemons in Hadoop like the NameNode, DataNode, ResourceManager, NodeManager, and others.

7 – Name the different commands for starting up and shutting down Hadoop Daemons.

To start up all the Hadoop Deamons together-

./sbin/start-all.sh

To shut down all the Hadoop Daemons together-

./sbin/stop-all.sh

To start up all the daemons related to DFS, YARN, and MR Job History Server, respectively-

./sbin/start-dfs.sh

./sbin/start-yarn.sh

sbin/mr-jobhistory-daemon.sh start history server

To stop the DFS, YARN, and MR Job History Server daemons, respectively-

./sbin/stop-dfs.sh
./sbin/stop-yarn.sh
/sbin/mr-jobhistory-daemon.sh stop historyserver

The final way is to start up and stop all the Hadoop Daemons individually –

./sbin/hadoop-daemon.sh start namenode
./sbin/hadoop-daemon.sh start datanode
./sbin/yarn-daemon.sh start resourcemanager
./sbin/yarn-daemon.sh start nodemanager
./sbin/mr-jobhistory-daemon.sh start historyserver

8 – Why do we need Hadoop for Big Data Analytics?

In most cases, exploring and analyzing large unstructured data sets becomes difficult with the lack of analysis tools. This is where Hadoop comes in as it offers storage, processing, and data collection capabilities. Hadoop stores data in its raw forms without the use of any schema and allows the addition of any number of nodes.

Since Hadoop is open-source and is run on commodity hardware, it is also economically feasible for businesses and organizations to use it for the purpose of Big Data Analytics.

9 – Explain the different features of Hadoop.

Listed in many Big Data Interview Questions and Answers, the answer to this is-

  • Open-Source- Open-source frameworks include source code that is available and accessible by all over the World Wide Web. These code snippets can be rewritten, edited, and modifying according to user and analytics requirements.
  • Scalability – Although Hadoop runs on commodity hardware, additional hardware resources can be added to new nodes.
  • Data Recovery – Hadoop allows the recovery of data by splitting blocks into three replicas across clusters. Hadoop allows users to recover data from node to node in cases of failure and recovers tasks/nodes automatically during such instances.
  • User-Friendly – For users who are new to Data Analytics, Hadoop is the perfect framework to use as its user interface is simple and there is no need for clients to handle distributed computing processes as the framework takes care of it.
  • Data Locality – Hadoop features Data Locality which moves computation to data instead of data to computation. Data is moved to clusters rather than bringing them to the location where MapReduce algorithms are processed and submitted.

10 – Define the Port Numbers for NameNode, Task Tracker and Job Tracker.

NameNode – Port 50070

Task Tracker – Port 50060

Job Tracker – Port 50030

11 – How does HDFS Index Data blocks? Explain.

HDFS indexes data blocks based on their respective sizes. The end of a data block points to the address of where the next chunk of data blocks get stored. The DataNodes store the blocks of data while the NameNode manages these data blocks by using an in-memory image of all the files of said data blocks. Clients receive information related to data blocked from the NameNode.

12 – What are Edge Nodes in Hadoop?

Edge nodes are gateway nodes in Hadoop which act as the interface between the Hadoop cluster and external network. They run client applications and cluster administration tools in Hadoop and are used as staging areas for data transfers to the Hadoop cluster. Enterprise-class storage capabilities (like 900GB SAS Drives with Raid HDD Controllers) is required for Edge Nodes, and a single edge node usually suffices for multiple Hadoop clusters.

13 – What are some of the data management tools used with Edge Nodes in Hadoop?

Oozie, Ambari, Hue, Pig, and Flume are the most common data management tools that work with edge nodes in Hadoop. Other similar tools include HCatalog, BigTop, and Avro.

14 – Explain the core methods of a Reducer.

There are three core methods of a reducer. They are-

1. setup() – Configures different parameters like distributed cache, heap size, and input data.

2. reduce() – A parameter that is called once per key with the concerned reduce task

3. cleanup() – Clears all temporary files and called only at the end of a reducer task.

15 – Talk about the different tombstone markers used for deletion purposes in HBase.

There are three main tombstone markers used for deletion in HBase. They are-

1. Family Delete Marker – Marks all the columns of a column family

2. Version Delete Marker – Marks a single version of a single column

3. Column Delete Marker– Marks all the versions of a single column

1) What is Hadoop Map Reduce ?

For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step map and reduce process.

2) How Hadoop MapReduce works?

In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.

3) Explain what is shuffling in MapReduce ?

The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle

4) Explain what is distributed Cache in MapReduce Framework ?

Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files or simple properties file.

5) Explain what is NameNode in Hadoop?

NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System). In other words, NameNode is the centrepiece of an HDFS file system. It keeps the record of all the files in the file system, and tracks the file data across the cluster or multiple machines

6) Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?

In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its own JVM process

Job Tracker performs following actions in Hadoop

  • Client application submit jobs to the job tracker
  • JobTracker communicates to the Namemode to determine data location
  • Near the data or with available slots JobTracker locates TaskTracker nodes
  • On chosen TaskTracker Nodes, it submits the work
  • When a task fails, Job tracker notify and decides what to do then.
  • The TaskTracker nodes are monitored by JobTracker

7) Explain what is heartbeat in HDFS?

Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker

8) Explain what combiners is and when you should use a combiner in a MapReduce Job?

To increase the efficiency of MapReduce Program, Combiners are used. The amount of data can be reduced with the help of combiner’s that need to be transferred across to the reducers. If the operation performed is commutative and associative you can use your reducer code as a combiner. The execution of combiner is not guaranteed in Hadoop

9) What happens when a datanode fails ?

When a datanode fails

  • Jobtracker and namenode detect the failure
  • On the failed node all tasks are re-scheduled
  • Namenode replicates the users data to another node

10) Explain what is Speculative Execution?

In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk. Disk that finish the task first are retained and disks that do not finish first are killed.

11) Explain what are the basic parameters of a Mapper?

The basic parameters of a Mapper are

  • LongWritable and Text
  • Text and IntWritable

12) Explain what is the function of MapReducer partitioner?

The function of MapReducer partitioner is to make sure that all the value of a single key goes to the same reducer, eventually which helps evenly distribution of the map output over the reducers

13) Explain what is difference between an Input Split and HDFS Block?

Logical division of data is known as Split while physical division of data is known as HDFS Block

14) Explain what happens in textinformat ?

In textinputformat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text

15) Mention what are the main configuration parameters that user need to specify to run Mapreduce Job ?

The user of Mapreduce framework needs to specify

  • Job’s input locations in the distributed file system
  • Job’s output location in the distributed file system
  • Input format
  • Output format
  • Class containing the map function
  • Class containing the reduce function
  • JAR file containing the mapper, reducer and driver classes

16) Explain what is WebDAV in Hadoop?

To support editing and updating files WebDAV is a set of extensions to HTTP. On most operating system WebDAV shares can be mounted as filesystems , so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.

17) Explain what is sqoop in Hadoop ?

To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS

18) Explain how JobTracker schedules a task ?

The task tracker send out heartbeat messages to Jobtracker usually every few minutes to make sure that JobTracker is active and functioning. The message also informs JobTracker about the number of available slots, so the JobTracker can stay upto date with where in the cluster work can be delegated

19) Explain what is Sequencefileinputformat?

Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.

20) Explain what does the conf.setMapper Class do ?

Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading data and generating a key-value pair out of the mapper

21) Explain what is Hadoop?

It is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides enormous processing power and massive storage for any type of data.

22) Mention what is the difference between an RDBMS and Hadoop?

RDBMS

Hadoop

RDBMS is relational database management system

Hadoop is node based flat structure

It used for OLTP processing whereas Hadoop

It is currently used for analytical and for BIG DATA processing

In RDBMS, the database cluster uses the same data files stored in shared storage

In Hadoop, the storage data can be stored independently in each processing node.

You need to preprocess data before storing it

you don’t need to preprocess data before storing it

23) Mention Hadoop core components?

Hadoop core components include,

  • HDFS
  • MapReduce

24) What is NameNode in Hadoop?

NameNode in Hadoop is where Hadoop stores all the file location information in HDFS. It is the master node on which job tracker runs and consists of metadata.

25) Mention what are the data components used by Hadoop?

Data components used by Hadoop are

  • Pig
  • Hive

26) Mention what is the data storage component used by Hadoop?

The data storage component used by Hadoop is HBase.

27) Mention what are the most common input formats defined in Hadoop?

The most common input formats defined in Hadoop are;

  • TextInputFormat
  • KeyValueInputFormat
  • SequenceFileInputFormat

28) In Hadoop what is InputSplit?

It splits input files into chunks and assign each split to a mapper for processing.

29) For a Hadoop job, how will you write a custom partitioner?

You write a custom partitioner for a Hadoop job, you follow the following path

  • Create a new class that extends Partitioner Class
  • Override method getPartition
  • In the wrapper that runs the MapReduce
  • Add the custom partitioner to the job by using method set Partitioner Class or – add the custom partitioner to the job as a config file

30) For a job in Hadoop, is it possible to change the number of mappers to be created?

No, it is not possible to change the number of mappers to be created. The number of mappers is determined by the number of input splits.

31) Explain what is a sequence file in Hadoop?

To store binary key/value pairs, sequence file is used. Unlike regular compressed file, sequence file support splitting even when the data inside the file is compressed.

32) When Namenode is down what happens to job tracker?

Namenode is the single point of failure in HDFS so when Namenode is down your cluster will set off.

33) Explain how indexing in HDFS is done?

Hadoop has a unique way of indexing. Once the data is stored as per the block size, the HDFS will keep on storing the last part of the data which say where the next part of the data will be.

34) Explain is it possible to search for files using wildcards?

Yes, it is possible to search for files using wildcards.

35) List out Hadoop’s three configuration files?

The three configuration files are

  • core-site.xml
  • mapred-site.xml
  • hdfs-site.xml

36) Explain how can you check whether Namenode is working beside using the jps command?

Beside using the jps command, to check whether Namenode are working you can also use

/etc/init.d/hadoop-0.20-namenode status.

37) Explain what is “map” and what is "reducer" in Hadoop?

In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.

In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.

38) In Hadoop, which file controls reporting in Hadoop?

In Hadoop, the hadoop-metrics.properties file controls reporting.

39) For using Hadoop list the network requirements?

For using Hadoop the list of network requirements are:

  • Password-less SSH connection
  • Secure Shell (SSH) for launching server processes

40) Mention what is rack awareness?

Rack awareness is the way in which the namenode determines on how to place blocks based on the rack definitions.

41) Explain what is a Task Tracker in Hadoop?

A Task Tracker in Hadoop is a slave node daemon in the cluster that accepts tasks from a JobTracker. It also sends out the heartbeat messages to the JobTracker, every few minutes, to confirm that the JobTracker is still alive.

42) Mention what daemons run on a master node and slave nodes?

  • Daemons run on Master node is "NameNode"
  • Daemons run on each Slave nodes are “Task Tracker” and "Data"

43) Explain how can you debug Hadoop code?

The popular methods for debugging Hadoop code are:

  • By using web interface provided by Hadoop framework
  • By using Counters

44) Explain what is storage and compute nodes?

  • The storage node is the machine or computer where your file system resides to store the processing data
  • The compute node is the computer or machine where your actual business logic will be executed.

45) Mention what is the use of Context Object?

The Context Object enables the mapper to interact with the rest of the Hadoop

system. It includes configuration data for the job, as well as interfaces which allow it to emit output.

46) Mention what is the next step after Mapper or MapTask?

The next step after Mapper or MapTask is that the output of the Mapper are sorted, and partitions will be created for the output.

47) Mention what is the number of default partitioner in Hadoop?

In Hadoop, the default partitioner is a “Hash” Partitioner.

48) Explain what is the purpose of RecordReader in Hadoop?

In Hadoop, the RecordReader loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper.

49) Explain how is data partitioned before it is sent to the reducer if no custom partitioner is defined in Hadoop?

If no custom partitioner is defined in Hadoop, then a default partitioner computes a hash value for the key and assigns the partition based on the result.

50) Explain what happens when Hadoop spawned 50 tasks for a job and one of the task failed?

It will restart the task again on some other TaskTracker if the task fails more than the defined limit.

51) Mention what is the best way to copy files between HDFS clusters?

The best way to copy files between HDFS clusters is by using multiple nodes and the distcp command, so the workload is shared.

52) Mention what is the difference between HDFS and NAS?

HDFS data blocks are distributed across local drives of all machines in a cluster while NAS data is stored on dedicated hardware.

53) Mention how Hadoop is different from other data processing tools?

In Hadoop, you can increase or decrease the number of mappers without worrying about the volume of data to be processed.

54) Mention what job does the conf class do?

Job conf class separate different jobs running on the same cluster. It does the job level settings such as declaring a job in a real environment.

55) Mention what is the Hadoop MapReduce APIs contract for a key and value class?

For a key and value class, there are two Hadoop MapReduce APIs contract

  • The value must be defining the org.apache.hadoop.io.Writable interface
  • The key must be defining the org.apache.hadoop.io.WritableComparable interface

56) Mention what are the three modes in which Hadoop can be run?

The three modes in which Hadoop can be run are

  • Pseudo distributed mode
  • Standalone (local) mode
  • Fully distributed mode

57) Mention what does the text input format do?

The text input format will create a line object that is an hexadecimal number. The value is considered as a whole line text while the key is considered as a line object. The mapper will receive the value as ‘text’ parameter while key as ‘longwriteable’ parameter.

58) Mention how many InputSplits is made by a Hadoop Framework?

Hadoop will make 5 splits

  • 1 split for 64K files
  • 2 split for 65mb files
  • 2 splits for 127mb files

59) Mention what is distributed cache in Hadoop?

Distributed cache in Hadoop is a facility provided by MapReduce framework. At the time of execution of the job, it is used to cache file. The Framework copies the necessary files to the slave node before the execution of any task at that node.

60) Explain how does Hadoop Classpath plays a vital role in stopping or starting in Hadoop daemons?

Classpath will consist of a list of directories containing jar files to stop or start daemons.

2. Explain “Big Data” and what are five V’s of Big Data?

“Big data” is the term for a collection of large and complex data sets, that makes it difficult to process using relational database management tools or traditional data processing applications. It is difficult to capture, curate, store, search, share, transfer, analyze, and visualize Big data. Big Data has emerged as an opportunity for companies. Now they can successfully derive value from their data and will have a distinct advantage over their competitors with enhanced business decisions making capabilities.

? Tip: It will be a good idea to talk about the 5Vs in such questions, whether it is asked specifically or not!

  • Volume : The volume represents the amount of data which is growing at an exponential rate i.e. in Petabytes and Exabytes.
  • Velocity : Velocity refers to the rate at which data is growing, which is very fast. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor to the velocity of growing data.
  • Variety : Variety refers to the heterogeneity of data types. In another word, the data which are gathered has a variety of formats like videos, audios, csv, etc. So, these various formats represent the variety of data.
  • Veracity : Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. Data available can sometimes get messy and may be difficult to trust. With many forms of big data, quality and accuracy are difficult to control. The volume is often the reason behind for the lack of quality and accuracy in the data.
  • Value : It is all well and good to have access to big data but unless we can turn it into a value it is useless. By turning it into value I mean, Is it adding to the benefits of the organizations? Is the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it adds to their profits by working on Big Data, it is useless.

. What is Hadoop and its components.

When “Big Data” emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which can’t be done efficiently and effectively using traditional systems.

4. What are HDFS and YARN?

HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology.

? Tip: It is recommended to explain the HDFS components too i.e.

  • NameNode : NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors etc.
  • DataNode : DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.

YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.

? Tip: Similarly, as we did in HDFS, we should also explain the two components of YARN:

  • ResourceManager : It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs.
  • NodeManager : NodeManager is installed on every DataNode and it is responsible for the execution of the task on every single DataNode.

Top 50 Hadoop Interview Questions for 2018

In this Hadoop interview questions blog, we will be covering all the frequently asked questions that will help you ace the interview with their best solutions. But before that, let me tell you how the demand is continuously increasing for Big Data and Hadoop experts. Big Data will drive $48.6 billion in annual spending by 2019- IDC.

  • McKinsey predicts that by 2018 there will be a shortage of 1.5M data experts
  • Average salary of a Big Data Hadoop developer in the US is $135k- Indeed.com
  • Average annual salary in the United Kingdom is £66,250-£66,750- itjobswatch.co.uk

I would like to draw your attention towards the Big Data revolution. Earlier, organizations were only concerned about operational data, which was less than 20% of the whole data. Later, they realized that analyzing the whole data will give them better business insights & decision-making capability. That was the time when big giants like Yahoo, Facebook, Google, etc. started adopting Hadoop & Big Data related technologies. In fact, nowadays one of every fifth company is moving to Big Data analytics. Hence, the demand for jobs in Big Data Hadoop is rising like anything. Therefore, if you want to boost your career, Hadoop and Spark are just the technology you need. This would always give you a good start either as a fresher or experienced.

1. What are the basic differences between relational database and HDFS?

Here are the key differences between HDFS and relational database:

RDBMS vs. Hadoop


RDBMS

Hadoop

Data Types

RDBMS relies on the structured data and the schema of the data is always known.

Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured.

Processing

RDBMS provides limited or no processing capabilities.

Hadoop allows us to process the data which is distributed across the cluster in a parallel fashion.

Schema on Read Vs. Write

RDBMS is based on ‘schema on write’ where schema validation is done before loading the data.

On the contrary, Hadoop follows the schema on read policy.

Read/Write Speed

In RDBMS, reads are fast because the schema of the data is already known.

The writes are fast in HDFS because no schema validation happens during HDFS write.

Cost

Licensed software, therefore, I have to pay for the software.

Hadoop is an open source framework. So, I don’t need to pay for the software.

Best Fit Use Case

RDBMS is used for OLTP (Online Trasanctional Processing) system.

Hadoop is used for Data discovery, data analytics or OLAP system.

2. Explain “Big Data” and what are five V’s of Big Data?

“Big data” is the term for a collection of large and complex data sets, that makes it difficult to process using relational database management tools or traditional data processing applications. It is difficult to capture, curate, store, search, share, transfer, analyze, and visualize Big data. Big Data has emerged as an opportunity for companies. Now they can successfully derive value from their data and will have a distinct advantage over their competitors with enhanced business decisions making capabilities.

? Tip: It will be a good idea to talk about the 5Vs in such questions, whether it is asked specifically or not!

  • Volume : The volume represents the amount of data which is growing at an exponential rate i.e. in Petabytes and Exabytes.
  • Velocity : Velocity refers to the rate at which data is growing, which is very fast. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor to the velocity of growing data.
  • Variety : Variety refers to the heterogeneity of data types. In another word, the data which are gathered has a variety of formats like videos, audios, csv, etc. So, these various formats represent the variety of data.
  • Veracity : Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. Data available can sometimes get messy and may be difficult to trust. With many forms of big data, quality and accuracy are difficult to control. The volume is often the reason behind for the lack of quality and accuracy in the data.
  • Value : It is all well and good to have access to big data but unless we can turn it into a value it is useless. By turning it into value I mean, Is it adding to the benefits of the organizations? Is the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it adds to their profits by working on Big Data, it is useless.

3. What is Hadoop and its components.

When “Big Data” emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which can’t be done efficiently and effectively using traditional systems.

? Tip: Now, while explaining Hadoop, you should also explain the main components of Hadoop, i.e.:

  • Storage unit – HDFS (NameNode, DataNode)
  • Processing framework – YARN (ResourceManager, NodeManager)

4. What are HDFS and YARN?

HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology.

? Tip: It is recommended to explain the HDFS components too i.e.

  • NameNode : NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors etc.
  • DataNode : DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.

YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.

? Tip: Similarly, as we did in HDFS, we should also explain the two components of YARN:

  • ResourceManager : It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs.
  • NodeManager : NodeManager is installed on every DataNode and it is responsible for the execution of the task on every single DataNode.

5. Tell me about the various Hadoop daemons and their roles in a Hadoop cluster.

Generally approach this question by first explaining the HDFS daemons i.e. NameNode, DataNode and Secondary NameNode, and then moving on to the YARN daemons i.e. ResorceManager and NodeManager, and lastly explaining the JobHistoryServer.

  • NameNode: It is the master node which is responsible for storing the metadata of all the files and directories. It has information about blocks, that make a file, and where those blocks are located in the cluster.
  • Datanode: It is the slave node that contains the actual data.
  • Secondary NameNode: It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.
  • ResourceManager: It is the central authority that manages resources and schedule applications running on top of YARN.
  • NodeManager: It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
  • JobHistoryServer: It maintains information about MapReduce jobs after the Application Master terminates.

Hadoop HDFS Interview Questions

6. Compare HDFS with Network Attached Storage (NAS).

In this question, first explain NAS and HDFS, and then compare their features as follows:

  • Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS can either be a hardware or software which provides services for storing and accessing files. Whereas Hadoop Distributed File System (HDFS) is a distributed filesystem to store data using commodity hardware.
  • In HDFS Data Blocks are distributed across all the machines in a cluster. Whereas in NAS data is stored on a dedicated hardware.
  • HDFS is designed to work with MapReduce paradigm, where computation is moved to the data. NAS is not suitable for MapReduce since data is stored separately from the computations.
  • HDFS uses commodity hardware which is cost-effective, whereas a NAS is a high-end storage devices which includes high cost.

7. List the difference between Hadoop 1 and Hadoop 2.

This is an important question and while answering this question, we have to mainly focus on two points i.e. Passive NameNode and YARN architecture.

  • In Hadoop 1.x, “NameNode” is the single point of failure. In Hadoop 2.x, we have Active and Passive “NameNodes”. If the active “NameNode” fails, the passive “NameNode” takes charge. Because of this, high availability can be achieved in Hadoop 2.x.
  • Also, in Hadoop 2.x, YARN provides a central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource. MRV2 is a particular type of distributed application that runs the MapReduce framework on top of YARN. Other tools can also perform data processing via YARN, which was a problem in Hadoop 1.x.

8. What are active and passive “NameNodes”?

In HA (High Availability) architecture, we have two NameNodes – Active “NameNode” and Passive “NameNode”.

  • Active “NameNode” is the “NameNode” which works and runs in the cluster.
  • Passive “NameNode” is a standby “NameNode”, which has similar data as active “NameNode”.

When the active “NameNode” fails, the passive “NameNode” replaces the active “NameNode” in the cluster. Hence, the cluster is never without a “NameNode” and so it never fails.

9. Why does one remove or add nodes in a Hadoop cluster frequently?

One of the most attractive features of the Hadoop framework is its utilization of commodity hardware. However, this leads to frequent “DataNode” crashes in a Hadoop cluster. Another striking feature of Hadoop Framework is the ease of scale in accordance with the rapid growth in data volume. Because of these two reasons, one of the most common task of a Hadoop administrator is to commission (Add) and decommission (Remove) “Data Nodes” in a Hadoop Cluster.

10. What happens when two clients try to access the same file in the HDFS?

HDFS supports exclusive writes only.

When the first client contacts the “NameNode” to open the file for writing, the “NameNode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “NameNode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.

11. How does NameNode tackle DataNode failures?

NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.

A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.

The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.

12. What will you do when NameNode is down?

The NameNode recovery process involves the following steps to make the Hadoop cluster up and running:

1. Use the file system metadata replica (FsImage) to start a new NameNode.

2. Then, configure the DataNodes and clients so that they can acknowledge this new NameNode, that is started.

3. Now the new NameNode will start serving the client after it has completed loading the last checkpoint FsImage (for metadata information) and received enough block reports from the DataNodes.

13. What is a checkpoint?

In brief, “Checkpointing” is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.

14. How is HDFS fault tolerant?

When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.

15. Can NameNode and DataNode be a commodity hardware?

The smart answer to this question would be, DataNodes are commodity hardware like personal computers and laptops as it stores data and are required in a large number. But from your experience, you can tell that, NameNode is the master node and it stores metadata about all the blocks stored in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space.

16. Why do we use HDFS for applications having large data sets and not when there are a lot of small files?

HDFS is more suitable for large amounts of data sets in a single file as compared to small amount of data spread across multiple files. As you know, the NameNode stores the metadata information regarding the file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too many files will lead to the generation of too much metadata. And, storing these metadata in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes.

17. How do you define “block” in HDFS? What is the default block size in Hadoop 1 and in Hadoop 2? Can it be changed?

Blocks are the nothing but the smallest continuous location on your hard drive where data is stored. HDFS stores each as blocks, and distribute it across the Hadoop cluster. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.

  • Hadoop 1 default block size: 64 MB
  • Hadoop 2 default block size: 128 MB

Yes, blocks can be configured. The dfs.block.size parameter can be used in the hdfs-site.xml file to set the size of a block in a Hadoop environment.

18. What does ‘jps’ command do?

The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.

19. How do you define “Rack Awareness” in Hadoop?

Rack Awareness is the algorithm in which the “NameNode” decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.

20. What is “speculative execution” in Hadoop?

If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.

21. How can I restart “NameNode” or all the daemons in Hadoop?

This question can have two answers, we will discuss both the answers. We can restart NameNode by following methods:

1. You can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then start the NameNode using. /sbin/hadoop-daemon.sh start namenode command.

2. To stop and start all the daemons, use./sbin/stop-all.sh and then use . /sbin/start-all.sh command which will stop all the daemons first and then start all the daemons.

These script files reside in the sbin directory inside the Hadoop directory.

22. What is the difference between an “HDFS Block” and an “Input Split”?

The “HDFS Block” is the physical division of the data while “Input Split” is the logical division of the data. HDFS divides data in blocks for storing the blocks together, whereas for processing, MapReduce divides the data into the input split and assign it to mapper function.

23. Name the three modes in which Hadoop can run.

The three modes in which Hadoop can run are as follows:

1. Standalone (local) mode: This is the default mode if we don’t configure anything. In this mode, all the components of Hadoop, such NameNode, DataNode, ResourceManager, and NodeManager, run as a single Java process. This uses the local filesystem.

2. Pseudo-distributed mode: A single-node Hadoop deployment is considered as running Hadoop system in pseudo-distributed mode. In this mode, all the Hadoop services, including both the master and the slave services, were executed on a single compute node.

3. Fully distributed mode: A Hadoop deployments in which the Hadoop master and slave services run on separate nodes, are stated as fully distributed mode.

4. 24. What is “MapReduce”? What is the syntax to run a “MapReduce” program?

It is a framework/a programming model that is used for processing large data sets over a cluster of computers using parallel programming. The syntax to run a MapReduce program is hadoop_jar_file.jar /input_path /output_path.

.

25. What are the main configuration parameters in a “MapReduce” program?

The main configuration parameters which users need to specify in “MapReduce” framework are:

  • Job’s input locations in the distributed file system
  • Job’s output location in the distributed file system
  • Input format of data
  • Output format of data
  • Class containing the map function
  • Class containing the reduce function
  • JAR file containing the mapper, reducer and driver classes

26. State the reason why we can’t perform “aggregation” (addition) in mapper? Why do we need the “reducer” for this?

This answer includes many points, so we will go through them sequentially.

  • We cannot perform “aggregation” (addition) in mapper because sorting does not occur in the “mapper” function. Sorting occurs only on the reducer side and without sorting aggregation cannot be done.
  • During “aggregation”, we need the output of all the mapper functions which may not be possible to collect in the map phase as mappers may be running on the different machine where the data blocks are stored.
  • And lastly, if we try to aggregate data at mapper, it requires communication between all mapper functions which may be running on different machines. So, it will consume high network bandwidth and can cause network bottlenecking.

27. What is the purpose of “RecordReader” in Hadoop?

The “InputSplit” defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”.

28. Explain “Distributed Cache” in a “MapReduce Framework”.

Distributed Cache can be explained as, a facility provided by the MapReduce framework to cache files needed by applications. Once you have cached a file for your job, Hadoop framework will make it available on each and every data nodes where you map/reduce tasks are running. Then you can access the cache file as a local file in your Mapper or Reducer job.

29. How do “reducers” communicate with each other?

This is a tricky question. The “MapReduce” programming model does not allow “reducers” to communicate with each other. “Reducers” run in isolation.

30. What does a “MapReduce Partitioner” do?

A “MapReduce Partitioner” makes sure that all the values of a single key go to the same “reducer”, thus allowing even distribution of the map output over the “reducers”. It redirects the “mapper” output to the “reducer” by determining which “reducer” is responsible for the particular key.

31. How will you write a custom partitioner?

Custom partitioner for a Hadoop job can be written easily by following the below steps:

  • Create a new class that extends Partitioner Class
  • Override method – getPartition, in the wrapper that runs in the MapReduce.
  • Add the custom partitioner to the job by using method set Partitioner or add the custom partitioner to the job as a config file.

32. What is a “Combiner”?

A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.

33. What do you know about “SequenceFileInputFormat”?

“SequenceFileInputFormat” is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce” job to the input of some other “MapReduce” job.

Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.

34. What are the benefits of Apache Pig over MapReduce?

Apache Pig is a platform, used to analyze large data sets representing them as data flows developed by Yahoo. It is designed to provide an abstraction over MapReduce, reducing the complexities of writing a MapReduce program.

  • Pig Latin is a high-level data flow language, whereas MapReduce is a low-level data processing paradigm.
  • Without writing complex Java implementations in MapReduce, programmers can achieve the same implementations very easily using Pig Latin.
  • Apache Pig reduces the length of the code by approx 20 times (according to Yahoo). Hence, this reduces the development period by almost 16 times.
  • Pig provides many built-in operators to support data operations like joins, filters, ordering, sorting etc. Whereas to perform the same function in MapReduce is a humongous task.
  • Performing a Join operation in Apache Pig is simple. Whereas it is difficult in MapReduce to perform a Join operation between the data sets, as it requires multiple MapReduce tasks to be executed sequentially to fulfill the job.
  • In addition, pig also provides nested data types like tuples, bags, and maps that are missing from MapReduce.

35. What are the different data types in Pig Latin?

Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.

Atomic data types: Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[].

Complex Data Types: Complex data types are Tuple, Map and Bag.

.

36. What are the different relational operations in “Pig Latin” you worked with?

Different relational operators are:

1. for each

2. order by

3. filters

4. group

5. distinct

6. join

7. limit

37. What is a UDF?

If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring those functionalities using other languages like Java, Python, Ruby, etc. and embed it in Script file.

38. What is “SerDe” in “Hive”?

Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data developed by Facebook. Hive abstracts the complexity of Hadoop MapReduce.

The “SerDe” interface allows you to instruct “Hive” about how a record should be processed. A “SerDe” is a combination of a “Serializer” and a “Deserializer”. “Hive” uses “SerDe” (and “FileFormat”) to read and write the table’s row.

39. Can the default “Hive Metastore” be used by multiple users (processes) at the same time?

“Derby database” is the default “Hive Metastore”. Multiple users (processes) cannot access it at the same time. It is mainly used to perform unit tests.

40. What is the default location where “Hive” stores table data?

The default location where Hive stores table data is inside HDFS in /user/hive/warehouse.

41. What is Apache HBase?

HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top of HDFS (Hadoop Distributed File System) and provides BigTable (Google) like capabilities to Hadoop. It is designed to provide a fault-tolerant way of storing the large collection of sparse data sets. HBase achieves high throughput and low latency by providing faster Read/Write Access on huge datasets.

43. What are the components of Region Server?

The components of a Region Server are:

  • WAL : Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage.
  • Block Cache : Block Cache resides in the top of Region Server. It stores the frequently read data in the memory.
  • MemStore : It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region.
  • HFile : HFile is stored in HDFS. It stores the actual cells on the disk.

44. Explain “WAL” in HBase?

Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage. It is used in case of failure to recover the data sets.

45. Mention the differences between “HBase” and “Relational Databases”?

HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top of HDFS and provides BigTable like capabilities to Hadoop. Let us see the differences between HBase and relational database.

Apache Spark Interview Questions

46. What is Apache Spark?

The answer to this question is, Apache Spark is a framework for real-time data analytics in a distributed computing environment. It executes in-memory computations to increase the speed of data processing.

It is 100x faster than MapReduce for large-scale data processing by exploiting in-memory computations and other optimizations.

47. Can you build “Spark” with any particular Hadoop version?

Yes, one can build “Spark” for a specific Hadoop version.

48. Define RDD.

RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD are immutable and distributed, which is a key component of Apache Spark.

Apache HBase Interview Questions

41. What is Apache HBase?

HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top of HDFS (Hadoop Distributed File System) and provides BigTable (Google) like capabilities to Hadoop. It is designed to provide a fault-tolerant way of storing the large collection of sparse data sets. HBase achieves high throughput and low latency by providing faster Read/Write Access on huge datasets.

42. What are the components of Apache HBase?

HBase has three major components, i.e. HMaster Server, HBase RegionServer and Zookeeper.

  • Region Server : A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
  • HMaster : It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS).
  • ZooKeeper : Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.

43. What are the components of Region Server?

The components of a Region Server are:

  • WAL : Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage.
  • Block Cache : Block Cache resides in the top of Region Server. It stores the frequently read data in the memory.
  • MemStore : It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region.
  • HFile : HFile is stored in HDFS. It stores the actual cells on the disk.

44. Explain “WAL” in HBase?

Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage. It is used in case of failure to recover the data sets.

45. Mention the differences between “HBase” and “Relational Databases”?

HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top of HDFS and provides BigTable like capabilities to Hadoop. Let us see the differences between HBase and relational database.

HBase vs. Relational Database

HBase

Relational Database

It is schema-less

It is schema-based database

It is column-oriented data store

It is row-oriented data store

It is used to store de-normalized data

It is used to store normalized data

It contains sparsely populated tables

It contains thin tables

Automated partitioning is done is HBase

There is no such provision or built-in support for partitioning

Apache Spark Interview Questions

46. What is Apache Spark?

The answer to this question is, Apache Spark is a framework for real-time data analytics in a distributed computing environment. It executes in-memory computations to increase the speed of data processing.

It is 100x faster than MapReduce for large-scale data processing by exploiting in-memory computations and other optimizations.

47. Can you build “Spark” with any particular Hadoop version?

Yes, one can build “Spark” for a specific Hadoop version.

48. Define RDD.

RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD are immutable and distributed, which is a key component of Apache Spark.

Oozie & ZooKeeper Interview Questions

49. What is Apache ZooKeeper and Apache Oozie?

Apache ZooKeeper coordinates with various services in a distributed environment. It saves a lot of time by performing synchronization, configuration maintenance, grouping and naming.

Apache Oozie is a scheduler which schedules Hadoop jobs and binds them together as one logical work. There are two kinds of Oozie jobs:

  • Oozie Workflow : These are the sequential set of actions to be executed. You can assume it as a relay race. Where each athlete waits for the last one to complete his part.
  • Oozie Coordinator : These are the Oozie jobs which are triggered when the data is made available to it. Think of this as the response-stimuli system in our body. In the same manner, as we respond to an external stimulus, an Oozie coordinator responds to the availability of data and it rests otherwise.

50. How do you configure an “Oozie” job in Hadoop?

“Oozie” is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs such as “Java MapReduce”, “Streaming MapReduce”, “Pig”, “Hive” and “Sqoop”.

7) What are the main components of a Hadoop Application? Hadoop applications have wide range of technologies that provide great advantage in solving complex business problems.

Core components of a Hadoop application are-

1) Hadoop Common

2) HDFS

3) Hadoop MapReduce

4) YARN

Data Access Components are - Pig and Hive

Data Storage Component is - HBase

Data Integration Components are - Apache Flume, Sqoop, Chukwa

Data Management and Monitoring Components are - Ambari, Oozie and Zookeeper.

Data Serialization Components are - Thrift and Avro

Data Intelligence Components are - Apache Mahout and Drill.

8. What is Hadoop streaming? Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers.

9. What is the best hardware configuration to run Hadoop? The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not low - end. ECC memory is recommended for running Hadoop because most of the Hadoop users have experienced various checksum errors by using non ECC memory. However, the hardware configuration also depends on the workflow requirements and can change accordingly.

10. What are the most commonly defined input formats in Hadoop? The most common Input Formats defined in Hadoop are:

  • Text Input Format- This is the default input format defined in Hadoop.
  • Key Value Input Format- This input format is used for plain text files wherein the files are broken down into lines.
  • Sequence File Input Format- This input format is used for reading files in sequence.

11. What are the steps involved in deploying a big data solution?

i) Data Ingestion – The foremost step in deploying big data solutions is to extract data from different sources which could be an Enterprise Resource Planning System like SAP, any CRM like Salesforce or Siebel , RDBMS like MySQL or Oracle, or could be the log files, flat files, documents, images, social media feeds. This data needs to be stored in HDFS. Data can either be ingested through batch jobs that run every 15 minutes, once every night and so on or through streaming in real-time from 100 ms to 120 seconds.

ii) Data Storage – The subsequent step after ingesting data is to store it either in HDFS or NoSQL database like HBase. HBase storage works well for random read/write access whereas HDFS is optimized for sequential access.

iii) Data Processing – The ultimate step is to process the data using one of the processing frameworks like mapreduce, spark, pig, hive, etc.

12. How will you choose various file formats for storing and processing data using Apache Hadoop ?

The decision to choose a particular file format is based on the following factors-

i) Schema evolution to add, alter and rename fields.

ii) Usage pattern like accessing 5 columns out of 50 columns vs accessing most of the columns.

iii)Splittability to be processed in parallel.

iv) Read/Write/Transfer performance vs block compression saving storage space

File Formats that can be used with Hadoop - CSV, JSON, Columnar, Sequence files, AVRO, and Parquet file.

CSV Files

CSV files are an ideal fit for exchanging data between hadoop and external systems. It is advisable not to use header and footer lines when using CSV files.

JSON Files

Every JSON File has its own record. JSON stores both data and schema together in a record and also enables complete schema evolution and splitability. However, JSON files do not support block level compression.

Avro FIiles

This kind of file format is best suited for long term storage with Schema. Avro files store metadata with data and also let you specify independent schema for reading the files.

Parquet Files

A columnar file format that supports block level compression and is optimized for query performance as it allows selection of 10 or less columns from from 50+ columns records.

11. What is Big Data? Big data is defined as the voluminous amount of structured, unstructured or semi-structured data that has huge potential for mining but is so large that it cannot be processed using traditional database systems. Big data is characterized by its high velocity, volume and variety that requires cost effective and innovative methods for information processing to draw meaningful business insights. More than the volume of the data – it is the nature of the data that defines whether it is considered as Big Data or not.

1. What is a block and block scanner in HDFS? Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.

Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.

2. Explain the difference between NameNode, Backup Node and Checkpoint NameNode. NameNode : NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-

fsimage file- It keeps track of the latest checkpoint of the namespace.

edits file-It is a log of changes that have been made to the namespace since checkpoint.

Checkpoint Node-

Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.

BackupNode:

Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

3. What is commodity hardware? Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs.

4. What is the port number for NameNode, Task Tracker and Job Tracker NameNode 50070

Job Tracker 50030

Task Tracker 50060

5. Explain about the process of inter cluster data copying.

HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.

6. How can you overwrite the replication factors in HDFS?

The replication factor in HDFS can be modified or overwritten in 2 ways-

1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-

$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)

2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-

3)$hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5)

7. Explain the difference between NAS and HDFS.

  • NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.
  • NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.
  • In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.

8. Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.

Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.

9. What is the process to change the files at arbitrary locations in HDFS?

HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.

10. Explain about the indexing process in HDFS.

Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.

11. What is a rack awareness and on what basis is data stored in a rack?

All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.

The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.

12. What happens to a NameNode that has no data?

There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.

13. What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.

The Hadoop job fails when the NameNode is down.

14. What happens when a user submits a Hadoop job when the Job Tracker is down- does the job get in to hold or does it fail.

The Hadoop job fails when the Job Tracker is down.

15. Whenever a client submits a hadoop job, who receives it?

NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.

16. What do you understand by edge nodes in Hadoop?

Edges nodes are the interface between hadoop cluster and the external network. Edge nodes are used for running cluster adminstration tools and client applications.Edge nodes are also referred to as gateway nodes.

We have further categorized Hadoop HDFS Interview Questions for Freshers and Experienced-

  • Hadoop Interview Questions and Answers for Freshers - Q.Nos- 2,3,7,9,10,11,13,14
  • Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1,2, 4,5,6,7,8,12,15

Hadoop MapReduce Interview Questions and Answers

1. Explain the usage of Context Object.

Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output.

2. What are the core methods of a Reducer?

The 3 core methods of a reducer are –

1)setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc.

Function Definition- public void setup (context)

2)reduce () it is heart of the reducer which is called once per key with the associated reduce task.

Function Definition -public void reduce (Key,Value,context)

3)cleanup () - This method is called only once at the end of reduce task for clearing all the temporary files.

Function Definition -public void cleanup (context)

3. Explain about the partitioning, shuffle and sort phase

Shuffle Phase- Once the first map tasks are completed, the nodes continue to perform several other map tasks and also exchange the intermediate outputs with the reducers as required. This process of moving the intermediate outputs of map tasks to the reducer is referred to as Shuffling.

Sort Phase - Hadoop MapReduce automatically sorts the set of intermediate keys on a single node before they are given as input to the reducer.

Partitioning Phase- The process that determines which intermediate keys and value will be received by each reducer instance is referred to as partitioning. The destination partition is same for any key irrespective of the mapper instance that generated it.

4. How to write a custom partitioner for a Hadoop MapReduce job?

Steps to write a Custom Partitioner for a Hadoop MapReduce Job-

  • A new class must be created that extends the pre-defined Partitioner Class.
  • getPartition method of the Partitioner class must be overridden.
  • The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop MapReduce or the custom partitioner can be added to the job by using the set method of the partitioner class.

5. What are side data distribution techniques in Hadoop?

The extra read only data required by a hadoop job to process the main dataset is referred to as side data. Hadoop has two side data distribution techniques -

i) Using the job configuration - This technique should not be used for transferring more than few kilobytes of data as it can pressurize the memory usage of hadoop daemons,particularly if your system is running several hadoop jobs.

ii) Distributed Cache - Rather than serializing side data using the job configuration, it is suggested to distribute data using hadoop's distributed cache mechanism.

We have further categorized Hadoop MapReduce Interview Questions for Freshers and Experienced-

  • Hadoop Interview Questions and Answers for Freshers - Q.Nos- 2
  • Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1,3,4,5

1. When should you use HBase and what are the key components of HBase?

HBase should be used when the big data application has –

1)A variable schema

2)When data is stored in the form of collections

3)If the application demands key based access to data while retrieving.

Key components of HBase are –

Region- This component contains memory data store and Hfile.

Region Server-This monitors the Region.

HBase Master-It is responsible for monitoring the region server.

Zookeeper- It takes care of the coordination between the HBase Master component and the client.

Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.

2. What are the different operational commands in HBase at record level and table level?

Record Level Operational Commands in HBase are –put, get, increment, scan and delete.

Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.

3. What is Row Key?

Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.

4. Explain the difference between RDBMS data model and HBase data model.

RDBMS is a schema based database whereas HBase is schema less data model.

RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.

RDBMS stores normalized data whereas HBase stores de-normalized data.

5. Explain about the different catalog tables in HBase?

The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.

6. What is column families? What happens if you alter the block size of ColumnFamily on an already populated database?

The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.

7. Explain the difference between HBase and Hive.

HBase and Hive both are completely different hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real time querying of big data where Hive is an ideal choice for analytical querying of data collected over period of time.

8. Explain the process of row deletion in HBase.

On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.

9. What are the different types of tombstone markers in HBase for deletion?

There are 3 different types of tombstone markers in HBase for deletion-

1)Family Delete Marker- This markers marks all columns for a column family.

2)Version Delete Marker-This marker marks a single version of a column.

3)Column Delete Marker-This markers marks all the versions of a column.

10. Explain about HLog and WAL in HBase.

All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.

We have further categorized Hadoop HBase Interview Questions for Freshers and Experienced-

  • Hadoop Interview Questions and Answers for Freshers - Q.Nos-1,2,4,5,7
  • Hadoop Interview Questions and Answers for Experienced - Q.Nos-2,3,6,8,9,10

1. Explain about some important Sqoop commands other than import and export.

Create Job (--create)

Here we are creating a job with the name my job, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file.

$ Sqoop job --create myjob \

--import \

--connect jdbc:mysql://localhost/db \

--username root \

--table employee --m 1

Verify Job (--list)

‘--list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs.

$ Sqoop job --list

Inspect Job (--show)

‘--show’ argument is used to inspect or verify particular jobs and their details. The following command and sample output is used to verify a job called myjob.

$ Sqoop job --show myjob

Execute Job (--exec)

‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob.

$ Sqoop job --exec myjob

2. How Sqoop can be used in a Java program?

The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line.

3. What is the process to perform an incremental data load in Sqoop?

The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop.

Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-

1)Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified.

2)Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.

3)Value (last-value) –This denotes the maximum value of the check column from the previous import operation.

4. Is it possible to do an incremental import using Sqoop?

Yes, Sqoop supports two types of incremental imports-

1)Append

2)Last Modified

To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command.

5. What is the standard location or path for Hadoop Sqoop scripts?

/usr/bin/Hadoop Sqoop

6. How can you check all the tables present in a single database using Sqoop?

The command to check the list of all tables present in a single database using Sqoop is as follows-

Sqoop list-tables –connect jdbc: mysql: //localhost/user;

7. How are large objects handled in Sqoop?

Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store-

1)CLOB ‘s – Character Large Objects

2)BLOB’s –Binary Large Objects

Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile” i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object.

8. Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?

Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.

9. Differentiate between Sqoop and distCP.

DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS.

10. What are the limitations of importing RDBMS tables into Hcatalog directly?

There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.

11. Is it sugggested to place the data transfer utility sqoop on an edge node ?

It is not suggested to place sqoop on an edge node or gateway node because the high data transfer volumes could risk the ability of hadoop services on the same node to communicate. Messages are the lifeblood of any hadoop service and high latency could result in the whole node being cut off from the hadoop cluster.

We have further categorized Hadoop Sqoop Interview Questions for Freshers and Experienced-

  • Hadoop Interview Questions and Answers for Freshers - Q.Nos- 4,5,6,9
  • Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1,2,3,6,7,8,10

1) Explain about the core components of Flume.

The core components of Flume are –

Event- The single log entry or unit of data that is transported.

Source- This is the component through which data enters Flume workflows.

Sink-It is responsible for transporting data to the desired destination.

Channel- it is the duct between the Sink and Source.

Agent- Any JVM that runs Flume.

Client- The component that transmits event to the source that operates with the agent.

2) Does Flume provide 100% reliability to the data flow?

Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow.

3) How can Flume be used with HBase?

Apache Flume can be used with HBase using one of the two HBase sinks –

  • HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96.
  • AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase.

Working of the HBaseSink –

In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster.

Working of the AsyncHBaseSink-

AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer.

4) Explain about the different channel types in Flume. Which channel type is faster?

The 3 different built in channel types available in Flume are-

MEMORY Channel – Events are read from the source into memory and passed to the sink.

JDBC Channel – JDBC Channel stores the events in an embedded Derby database.

FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink.

MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event.

5) Which is the reliable channel in Flume to ensure that there is no data loss?

FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.

6) Explain about the replication and multiplexing selectors in Flume.

Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event can be written just to a single channel or to multiple channels. If a channel selector is not specified to the source then by default it is the Replicating selector. Using the replicating selector, the same event is written to all the channels in the source’s channels list. Multiplexing channel selector is used when the application has to send different events to different channels.

7) How multi-hop agent can be setup in Flume?

Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.

8) Does Apache Flume provide support for third party plug-ins?

Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.

9) Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how.

Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers using MorphlineSolrSink

10) Differentiate between FileSink and FileRollSink

The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system.

Hadoop Flume Interview Questions and Answers for Freshers - Q.Nos- 1,2,4,5,6,10

Hadoop Flume Interview Questions and Answers for Experienced- Q.Nos- 3,7,8,9

Hadoop Zookeeper Interview Questions and Answers

1) Can Apache Kafka be used without Zookeeper?

It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka cannot serve client request.

2) Name a few companies that use Zookeeper.

Yahoo, Solr, Helprace, Neo4j, Rackspace

3) What is the role of Zookeeper in HBase architecture?

In HBase architecture, ZooKeeper is the monitoring server that provides different services like –tracking server failure and network partitions, maintaining the configuration information, establishing communication between the clients and region servers, usability of ephemeral nodes to identify the available servers in the cluster.

4) Explain about ZooKeeper in Kafka

Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store various configurations and use them across the hadoop cluster in a distributed manner. To achieve distributed-ness, configurations are distributed and replicated throughout the leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by bye-passing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request.

5) Explain how Zookeeper works

ZooKeeper is referred to as the King of Coordination and distributed applications use ZooKeeper to store and facilitate important configuration information updates. ZooKeeper works by coordinating the processes of distributed applications. ZooKeeper is a robust replicated synchronization service with eventual consistency. A set of nodes is known as an ensemble and persisted data is distributed between multiple nodes.

3 or more independent servers collectively form a ZooKeeper cluster and elect a master. One client connects to any of the specific server and migrates if a particular node fails. The ensemble of ZooKeeper nodes is alive till the majority of nods are working. The master node in ZooKeeper is dynamically selected by the consensus within the ensemble so if the master node fails then the role of master node will migrate to another node which is selected dynamically. Writes are linear and reads are concurrent in ZooKeeper.

6) List some examples of Zookeeper use cases.

  • Found by Elastic uses Zookeeper comprehensively for resource allocation, leader election, high priority notifications and discovery. The entire service of Found built up of various systems that read and write to Zookeeper.
  • Apache Kafka that depends on ZooKeeper is used by LinkedIn
  • Storm that relies on ZooKeeper is used by popular companies like Groupon and Twitter.

7) How to use Apache Zookeeper command line interface?

ZooKeeper has a command line client support for interactive use. The command line interface of ZooKeeper is similar to the file and shell system of UNIX. Data in ZooKeeper is stored in a hierarchy of Znodes where each znode can contain data just similar to a file. Each znode can also have children just like directories in the UNIX file system.

Zookeeper-client command is used to launch the command line client. If the initial prompt is hidden by the log messages after entering the command, users can just hit ENTER to view the prompt.

8) What are the different types of Znodes?

There are 2 types of Znodes namely- Ephemeral and Sequential Znodes.

  • The Znodes that get destroyed as soon as the client that created it disconnects are referred to as Ephemeral Znodes.
  • Sequential Znode is the one in which sequential number is chosen by the ZooKeeper ensemble and is pre-fixed when the client assigns name to the znode.

9) What are watches?

Client disconnection might be troublesome problem especially when we need to keep a track on the state of Znodes at regular intervals. ZooKeeper has an event system referred to as watch which can be set on Znode to trigger an event whenever it is removed, altered or any new children are created below it.

10) What problems can be addressed by using Zookeeper?

In the development of distributed systems, creating own protocols for coordinating the hadoop cluster results in failure and frustration for the developers. The architecture of a distributed system can be prone to deadlocks, inconsistency and race conditions. This leads to various difficulties in making the hadoop cluster fast, reliable and scalable. To address all such problems, Apache ZooKeeper can be used as a coordination service to write correct distributed applications without having to reinvent the wheel from the beginning.

Hadoop ZooKeeper Interview Questions and Answers for Freshers - Q.Nos- 1,2,8,9

Hadoop ZooKeeper Interview Questions and Answers for Experienced- Q.Nos-3,4,5,6,7, 10

Hadoop Pig Interview Questions and Answers

1) What are different modes of execution in Apache Pig?

Apache Pig runs in 2 modes- one is the “Pig (Local Mode) Command Mode” and the other is the “Hadoop MapReduce (Java) Command Mode”. Local Mode requires access to only a single machine where all files are installed and executed on a local host whereas MapReduce requires accessing the Hadoop cluster.

2) Explain about co-group in Pig.

COGROUP operator in Pig is used to work with multiple tuples. COGROUP operator is applied on statements that contain or involve two or more relations. The COGROUP operator can be applied on up to 127 relations at a time. When using the COGROUP operator on two tables at once-Pig first groups both the tables and after that joins the two tables on the grouped columns.

We have further categorized Hadoop Pig Interview Questions for Freshers and Experienced-

  • Hadoop Interview Questions and Answers for Freshers - Q.No-1
  • Hadoop Interview Questions and Answers for Experienced - Q.No- 2

Hadoop Hive Interview Questions and Answers

1) Explain about the SMB Join in Hive.

In SMB join in Hive, each mapper reads a bucket from the first table and the corresponding bucket from the second table and then a merge sort join is performed. Sort Merge Bucket (SMB) join in hive is mainly used as there is no limit on file or partition or table join. SMB join can best be used when the tables are large. In SMB join the columns are bucketed and sorted using the join columns. All tables should have the same number of buckets in SMB join.

2) How can you connect an application, if you run Hive as a server?

When running Hive as a server, the application can be connected in one of the 3 ways-

ODBC Driver-This supports the ODBC protocol

JDBC Driver- This supports the JDBC protocol

Thrift Client- This client can be used to make calls to all hive commands using different programming language like PHP, Python, Java, C++ and Ruby.

3) What does the overwrite keyword denote in Hive load statement?

Overwrite keyword in Hive load statement deletes the contents of the target table and replaces them with the files referred by the file path i.e. the files that are referred by the file path will be added to the table when using the overwrite keyword.

4) What is SerDe in Hive? How can you write your own custom SerDe?

SerDe is a Serializer DeSerializer. Hive uses SerDe to read and write data from tables. Generally, users prefer to write a Deserializer instead of a SerDe as they want to read their own data format rather than writing to it. If the SerDe supports DDL i.e. basically SerDe with parameterized columns and different column types, the users can implement a Protocol based DynamicSerDe rather than writing the SerDe from scratch.

We have further categorized Hadoop Hive Interview Questions for Freshers and Experienced-

Hadoop Hive Interview Questions and Answers for Freshers- Q.Nos-3

Hadoop Hive Interview Questions and Answers for Experienced- Q.Nos-1,2,4

1)What are the stable versions of Hadoop?

Release 2.7.1 (stable)

Release 2.4.1

Release 1.2.1 (stable)

2) What is Apache Hadoop YARN?

YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications.

3) Is YARN a replacement of Hadoop MapReduce?

YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that supports MapReduce and is also referred to as Hadoop 2.0 or MapReduce 2.

4) What are the additional benefits YARN brings in to Hadoop?

  • Effective utilization of the resources as multiple applications can be run in YARN all sharing a common resource.In Hadoop MapReduce there are seperate slots for Map and Reduce tasks whereas in YARN there is no fixed slot. The same container can be used for Map and Reduce tasks leading to better utilization.
  • YARN is backward compatible so all the existing MapReduce jobs.
  • Using YARN, one can even run applications that are not based on the MaReduce model

5) How can native libraries be included in YARN jobs?

There are two ways to include native libraries in YARN jobs-

1) By setting the -Djava.library.path on the command line but in this case there are chances that the native libraries might not be loaded correctly and there is possibility of errors.

2) The better option to include native libraries is to the set the LD_LIBRARY_PATH in the .bashrc file.

6) Explain the differences between Hadoop 1.x and Hadoop 2.x

  • In Hadoop 1.x, MapReduce is responsible for both processing and cluster management whereas in Hadoop 2.x processing is taken care of by other processing models and YARN is responsible for cluster management.
  • Hadoop 2.x scales better when compared to Hadoop 1.x with close to 10000 nodes per cluster.
  • Hadoop 1.x has single point of failure problem and whenever the NameNode fails it has to be recovered manually. However, in case of Hadoop 2.x StandBy NameNode overcomes the SPOF problem and whenever the NameNode fails it is configured for automatic recovery.
  • Hadoop 1.x works on the concept of slots whereas Hadoop 2.x works on the concept of containers and can also run generic tasks.

7) What are the core changes in Hadoop 2.0?

Hadoop 2.x provides an upgrade to Hadoop 1.x in terms of resource management, scheduling and the manner in which execution occurs. In Hadoop 2.x the cluster resource management capabilities work in isolation from the MapReduce specific programming logic. This helps Hadoop to share resources dynamically between multiple parallel processing frameworks like Impala and the core MapReduce component. Hadoop 2.x Hadoop 2.x allows workable and fine grained resource configuration leading to efficient and better cluster utilization so that the application can scale to process larger number of jobs.

8) Differentiate between NFS, Hadoop NameNode and JournalNode.

HDFS is a write once file system so a user cannot update the files once they exist either they can read or write to it. However, under certain scenarios in the enterprise environment like file uploading, file downloading, file browsing or data streaming –it is not possible to achieve all this using the standard HDFS. This is where a distributed file system protocol Network File System (NFS) is used. NFS allows access to files on remote machines just similar to how local file system is accessed by applications.

Namenode is the heart of the HDFS file system that maintains the metadata and tracks where the file data is kept across the Hadoop cluster.

StandBy Nodes and Active Nodes communicate with a group of light weight nodes to keep their state synchronized. These are known as Journal Nodes.

9) What are the modules that constitute the Apache Hadoop 2.0 framework?

Hadoop 2.0 contains four important modules of which 3 are inherited from Hadoop 1.0 and a new module YARN is added to it.

1. Hadoop Common – This module consists of all the basic utilities and libraries that required by other modules.

2. HDFS- Hadoop Distributed file system that stores huge volumes of data on commodity machines across the cluster.

3. MapReduce- Java based programming model for data processing.

4. YARN- This is a new module introduced in Hadoop 2.0 for cluster resource management and job scheduling.

10) How is the distance between two nodes defined in Hadoop?

Measuring bandwidth is difficult in Hadoop so network is denoted as a tree in Hadoop. The distance between two nodes in the tree plays a vital role in forming a Hadoop cluster and is defined by the network topology and java interface DNStoSwitchMapping. The distance is equal to the sum of the distance to the closest common ancestor of both the nodes. The method getDistance(Node node1, Node node2) is used to calculate the distance between two nodes with the assumption that the distance from a node to its parent node is always 1.

We have further categorized Hadoop YARN Interview Questions for Freshers and Experienced-

  • Hadoop Interview Questions and Answers for Freshers - Q.Nos- 2,3,4,6,7,9
  • Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1,5,8,10

Hadoop Testing Interview Questions

1) How will you test data quality ?

The entire data that has been collected could be important but all data is not equal so it is necessary to first define from where the data came , how the data would be used and consumed. Data that will be consumed by vendors or customers within the business ecosystem should be checked for quality and needs to cleaned. This can be done by applying stringent data quality rules and by inspecting different properties like conformity, perfection, repetition, reliability, validity, completeness of data, etc.

2) What are the challenges that you encounter when testing large datasets?

  • More data needs to be substantiated.
  • Testing large datsets requires automation.
  • Testing options across all platforms need to be defined.

3) What do you understand by MapReduce validation ?

This the subsequent and most important step of the big data testing process. Hadoop developer needs to verify the right implementation of the business logic on every hadoop cluster node and validate the data after executing it on all the nodes to determine -

  • Appropriate functioning of the MapReduce function.
  • Validate if rules for data segregation are implemented.
  • Pairing and creation of key-value pairs.

Candidates should not be afraid to ask questions to the interviewer. To ease this for hadoop job seekers, DeZyre has collated few hadoop interview FAQ’s that every candidate should ask an interviewer during their next hadoop job interview-

1) What is the size of the biggest hadoop cluster a company X operates?

Asking this question helps a hadoop job seeker understand the hadoop maturity curve at a company.Based on the answer of the interviewer, a candidate can judge how much an organization invests in Hadoop and their enthusiasm to buy big data products from various vendors. The candidate can also get an idea on the hiring needs of the company based on their hadoop infrastructure.

2) For what kind of big data problems, did the organization choose to use Hadoop?

Asking this question to the interviewer shows the candidates keen interest in understanding the reason for hadoop implementation from a business perspective. This question gives the impression to the interviewer that the candidate is not merely interested in the hadoop developer job role but is also interested in the growth of the company.

3) Based on the answer to question no 1, the candidate can ask the interviewer why the hadoop infrastructure is configured in that particular way, why the company chose to use the selected big data tools and how workloads are constructed in the hadoop environment.

Asking this question to the interviewer gives the impression that you are not just interested in maintaining the big data system and developing products around it but are also seriously thoughtful on how the infrastructure can be improved to help business growth and make cost savings.

4) What kind of data the organization works with or what are the HDFS file formats the company uses?

The question gives the candidate an idea on the kind of big data he or she will be handling if selected for the hadoop developer job role. Based on the data, it gives an idea on the kind of analysis they will be required to perform on the data.

5) What is the most complex problem the company is trying to solve using Apache Hadoop?

Asking this question helps the candidate know more about the upcoming projects he or she might have to work and what are the challenges around it. Knowing this beforehand helps the interviewee prepare on his or her areas of weakness.

6) Will I get an opportunity to attend big data conferences? Or will the organization incur any costs involved in taking advanced hadoop or big data certification?

This is a very important question that you should be asking these the interviewer. This helps a candidate understand whether the prospective hiring manager is interested and supportive when it comes to professional development of the employee.

1) If you are an experienced hadoop professional then you are likely to be asked questions like –

  • The number of nodes you have worked with in a cluster.
  • Which hadoop distribution have you used in your recent project.
  • Your experience on working with special configurations like High Availability.
  • The data volume you have worked with in your most recent project.
  • What are the various tools you used in the big data and hadoop projects you have worked on?

Your answer to these interview questions will help the interviewer understand your expertise in Hadoop based on the size of the hadoop cluster and number of nodes. Based on the highest volume of data you have handled in your previous projects, interviewer can assess your overall experience in debugging and troubleshooting issues involving huge hadoop clusters.

The number of tools you have worked with help an interviewer judge that you are aware of the overall hadoop ecosystem and not just MapReduce. To be selected, it all depends on how well you communicate the answers to all these questions.

2) What are the challenges that you faced when implementing hadoop projects?

Interviewers are interested to know more about the various issues you have encountered in the past when working with hadoop clusters and understand how you addressed them. The way you answer this question tells a lot about your expertise in troubleshooting and debugging hadoop clusters.The more issues you have encountered, the more probability there is, that you have become an expert in that area of Hadoop. Ensure that you list out all the issues that have trouble-shooted.

3) How were you involved in data modelling, data ingestion, data transformation and data aggregation?

You are likely to be involved in one or more phases when working with big data in a hadoop environment. The answer to this question helps the interviewer understand what kind of tools you are familiar with.If you answer that your focus was mainly on data ingestion then they can expect you to be well-versed with Sqoop and Flume, if you answer that you were involved in data analysis and data transformation then it gives the interviewer an impression that you have expertise in using Pig and Hive.

4) What is your favourite tool in the hadoop ecosystem?

The answer to this question will help the interviewer know more about the big data tools that you are well-versed with and are interested in working with. If you show affinity towards a particular tool then the probability that you will be deployed to work on that particular tool, is more.If you say that you have a good knowledge of all the popular big data tools like pig, hive, HBase, Sqoop, flume then it shows that you have knowledge about the hadoop ecosystem as a whole.

5) In you previous project, did you maintain the hadoop cluster in-house or used hadoop in the cloud?

Most of the organizations still do not have the budget to maintain hadoop cluster in-house and they make use of hadoop in the cloud from various vendors like Amazon, Microsoft, Google, etc. Interviewer gets to know about your familiarity with using hadoop in the cloud because if the company does not have an in-house implementation then hiring a candidate who has knowledge about using hadoop in the cloud is worth it.

BigData Interview Questins asked at Top Tech Companies

1) Write a MapReduce program to add all the elements in a file. (Hadoop Developer Interview Question asked at KPIT)

2) What is the difference between hashset and hashmap ? (Big Data Interview Question asked at Wipro)

3) Write a Hive program to find the number of employees department wise in an organization. ( Hadoop Developer Interview Question asked at Tripod Technologies)

4) How will you read a CSV file of 10GB and store it in the database as it is in just few seconds ? (Hadoop Interview Question asked at Deutsche Bank)

5) How will a file of 100MB be stored in Hadoop ? (Hadoop Interview Question asked at Deutsche Bank)

Source: Contents are provided by Technicalsymposium Google Group Members.
Disclaimer: All the above contents are provided by technicalsymposium.com Google Group members.
Further, this content is not intended to be used for commercial purpose. Technicalsymposium.com is not liable/responsible for any copyright issues.

Technical Symposium.Com All Jobs & Certifications
Lecture Notes and Scholarships & Project Details

Fresher Jobs Details
job detailsAll Fresher Software Jobs Details
job detailsAll Government Jobs Details
job detailsAll Internships Details
job detailsOff-Campus Interviews/Walk-in Details
job detailsAll Competitive Exam Details
job detailsBank Clerk/Officers Jobs Details
job detailsB.Sc/M.Sc Graduate Fresher Jobs
job detailsAll Technical Jobs Details
job detailsAll State/Central Government Jobs Details
Conference/Symposium Details
job detailsAll International Conference Details
job detailsAll National Conference Details
job detailsAll Symposium/Workshop Details
job detailsUniversity Conferences Details
job detailsIIT/NIT/IIM Conference Details
job detailsComputer Science Conference Details
job detailsElectronics Conferences Details
job detailsMechanical Conferences Details
job detailsAll Latest Conferences Details
Calendar of Events Details
job detailsJanuary – List of All Event Details
job detailsFebruary – List of All Event Details
job detailsMarch– List of All Event Details
job detailsApril – List of All Event Details
job detailsMay – List of All Event Details
job detailsJune – List of All Event Details
job detailsJuly – List of All Event Details
job detailsAugust – List of All Event Details
job detailsAll Months - List of All Events Details
GATE Question Papers & Syllabus
job detailsAll GATE Previous Year Question Papers
job detailsGATE Syllabus for All Branches
job detailsGATE-Computer Science Question Papers
job detailsGATE – Mechanical Question Papers
job detailsGATE – ECE Question Papers
job detailsGATE- EEE Question Papers
job detailsGATE – Civil Engineering Question Papers
job detailsGATE – All Branches Question Papers
job detailsGATE Exam Pattern Details
IES Question Papers & Syllabus
job detailsIES Complete Information
job detailsIES Previous Year Question Papers
job detailsIES Syllabus for All Branches
job detailsIES – Mechanical Question Papers
job detailsIES - E & T Question Papers
job detailsIES - EEE Question Papers
job detailsIES -Civil Engineering Question Papers
job detailsIES - All Branches Question Papers
job detailsIES - Exam Pattern Details
All Companies Placement Materials
job detailsTCS Placement Papers
job detailsWIPRO Placement Papers
job detailsZOHO Placement Papers
job detailsCTS Placement Papers
job detailsINFOSYS Placement Papers
job detailsARICENT Placement Papers
job detailsSBI Placement Papers
job detailsRRB Placement Papers
job detailsBANK/IBPS Placement Papers
Aptitude Round Materials
job details Problem on Numbers
job detailsProblem on Trains
job detailsTime and Work
job detailsSimple Interest
job detailsAptitude Short- Cut Formulas
job detailsAptitude Tricks
job detailsOdd Man Out Series
job detailsSimplification Problems
job detailsProfit & Loss
Technical Round Materials
job detailsJAVA Technical Round Q & A
job detailsC Technical Round Q & A
job detailsC++ Technical Round Q & A
job detailsASP.NET Technical Round Q & A
job detailsData Structure Technical Round Q & A
job detailsSQL Technical Round Q & A
job detailsC# Technical Round Q & A
job detailsPHP Technical Round Q & A
job detailsJ2EE Technical Round Q & A
Software Programs and Lab Manuals
job detailsSurvey Lab Manuals
job detailsCAD/CAM Lab Manuals
job detailsOOPS/C++ Lab Manuals
job detailsJAVA/Internet Programming Manuals
job detailsList of All C Programs with Source Code
job detailsList of All Data Structure Programs
job detailsC Aptitude Programs
job detailsMicroprocessor & Micro Controller Manuals
job detailsAll Engineering Branches Lab Manuals
Anna University Syllabus & Question Papers
job detailsB.E/B.Tech Previous Year Question Papers
job detailsBE-Computer Science Syllabus
job detailsBE-Mechanical Engineering Syllabus
job detailsBE-Civil Engineering Syllabus
job detailsBE-ECE Syllabus
job detailsBE-EEE Syllabus
job detailsMBA/MCA Syllabus
job detailsBE-Aeronautical Engineering Syllabus
job detailsBE-Biomedical Syllabus
IBPS/Bank Syllabus & Question Papers
job detailsIBPS Clerk Exam Syllabus
job detailsIBPS Clerk Question Papers
job detailsIBPS Probationary Officers Syllabus
job detailsIBPS PO Question Papers
job detailsIBPS RRB Exam Syllabus
job detailsIBPS Special Officers Exam Syllabus
job detailsSBI Clerk/PO Syllabus
job detailsAll Private Banks PO/Clerk Syllabuses
job detailsAll Nationalized Bank Po/Clerk Syllabuses
Civil Engineering Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsSeventh Semester Lecture Notes
job detailsEighth Semester Lecture Notes
job detailsLab Manuals
Mechanical Engineering Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsSeventh Semester Lecture Notes
job detailsEighth Semester Lecture Notes
job detailsLab Manuals
Computer Science Engineering Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsSeventh Semester Lecture Notes
job detailsEighth Semester Lecture Notes
job detailsLab Manuals
MCA Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsSeventh Semester Lecture Notes
job detailsEighth Semester Lecture Notes
job detailsLab Manuals
MBA Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsSeventh Semester Lecture Notes
job detailsEighth Semester Lecture Notes
job detailsLab Manuals
ECE Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsSeventh Semester Lecture Notes
job detailsEighth Semester Lecture Notes
job detailsLab Manuals
EEE Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsSeventh Semester Lecture Notes
job detailsEighth Semester Lecture Notes
job detailsLab Manuals
Biotechnology Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsSeventh Semester Lecture Notes
job detailsEighth Semester Lecture Notes
job detailsLab Manuals
Chemical Engineering Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsSeventh Semester Lecture Notes
job detailsEighth Semester Lecture Notes
job detailsLab Manuals
Biomedical Engineering Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsSeventh Semester Lecture Notes
job detailsEighth Semester Lecture Notes
job detailsLab Manuals
Aeronautical Engineering Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsSeventh Semester Lecture Notes
job detailsEighth Semester Lecture Notes
job detailsLab Manuals
B.Sc (All Major Branches) Lecture Notes
job detailsMaths / Statistics Lecture Notes
job detailsPhysics Lecture Notes
job detailsChemistry Lecture Notes
job detailsElectronics Lecture Notes
job detailsComputer Science/IT Lecture Notes
job detailsAgriculture Lecture Notes
job detailsBiotechnology/Biomedical Lecture Notes
job detailsNursing Lecture Notes
job detailsFashion Technology Lecture Notes
M.Sc (All Major Branches) Lecture Notes
job detailsComputer Science/IT Lecture Notes
job detailsAgriculture Lecture Notes
job detailsBiotechnology/Biomedical Lecture Notes
job detailsNursing Lecture Notes
job detailsFashion Technology Lecture Notes
job detailsMaths / Statistics Lecture Notes
job detailsPhysics Lecture Notes
job detailsChemistry Lecture Notes
job detailsElectronics Lecture Notes
B.Com/M.Com Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsLab Manuals
job detailsCommerce Placement Materials
job detailsCommerce Placement Materials
BBA Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsLab Manuals
job detailsPlacement MaterialsEighth Semester Lecture Notes
job detailsFresher Jobs Details
BCA Lecture Notes
job detailsFirst Semester Lecture Notes
job detailsSecond Semester Lecture Notes
job detailsThird Semester Lecture Notes
job detailsFourth Semester Lecture Notes
job detailsFifth Semester Lecture Notes
job detailsSixth Semester Lecture Notes
job detailsLab Manuals
job detailsPlacement Materials
job detailsFresher Jobs Details

Hosting by Yahoo!

About-Us    Contact-Us    Site-map

©copyright All rights are reserved to technicalsymposium.com