Big Data Hadoop Interview Questions and Answers

1. Mention what is the responsibility of a Data analyst?

Responsibility of a Data analyst include,

Provide support to all data analysis and coordinate with customers and staffs

Resolve business associated issues for clients and performing audit on data

Analyze results and interpret data using statistical techniques and provide ongoing reports

Prioritize business needs and work closely with management and information needs

Identify new process or areas for improvement opportunities

Analyze, identify and interpret trends or patterns in complex data sets

Acquire data from primary or secondary data sources and maintain databases/data systems

Filter and “clean” data, and review computer reports

Determine performance indicators to locate and correct code problems

Securing database by developing access system by determining user level of access

2. What is required to become a data analyst?

To become a data analyst,

Robust knowledge on reporting packages (Business Objects), programming language.

(XML, Javascript, or ETL frameworks), databases (SQL, SQLite, etc.)

Strong skills with the ability to analyze, organize, collect and disseminate big data with accuracy

Technical knowledge in database design, data models, data mining and segmentation techniques

Strong knowledge on statistical packages for survey large datasets (SAS, Excel, SPSS, etc.)

3. Mention what are the various steps in an analytics project?

Various steps in an analytics project include

Problem definition

Data exploration

Data preparation

Modelling

Validation of data

Implementation and tracking

4. What is data cleansing?

Data cleaning also called as data cleansing, deals with recognizing and removing errors and irregularity from data in order to enhance the classification of data.

5. List out some of the best practices for data cleaning?

Sort data by different attributes

For large datasets cleanse one after the other and develop the data with each step until you reach a good data quality

For large datasets, break them into small data. Working with less data will increase your iteration speed

To handle general cleansing task create a set of utility functions/tools/scripts. It might involve, assign various values based on a CSV file.

OR SQL database or, regex search-and-replace, blanking out all values that don’t match a regex

If you have an issue with data cleanliness, order them by estimated frequency and attack the most common problems

Analyze the summary statistics for each column ( standard deviation, mean, number of missing values,)

Keep lane of every date cleaning operation, so you can improve changes or remove operations if required

6. What is logistic regression?

Logistic regression is a statistical method for examining a dataset in which there are one or more independent variables that defines an outcome.

7. List of some best tools that can be useful for data-analysis?

Tableau

RapidMiner

OpenRefine

KNIME

Google Search Operators

Solver

NodeXL

Wolfram Alpha’s

Google Fusion tables

8. What is the difference between data mining and data profiling?

The difference between data mining and data profiling is that
Data profiling: It targets on the instance analysis of individual attributes. It gives information on various attributes like value range, discrete value and their frequency, occurrence of null values, data type, length, etc.
Data mining: Complete attention is on cluster analysis, detection of different records, possession, order discovery, relation holding between some attributes, etc.

9. List out some common problems faced by data analyst?

Some of the common problems faced by data analyst are:

Common misspelling

Duplicate entries

Missing values

Illegal values

Varying value representations

Identifying overlapping data

10. Mention the name of the framework developed by Apache for processing large data set for an application in a distributed computing environment?

Hadoop and MapReduce is the programming framework developed by Apache for processing large data set for an application in a distributed computing environment.

11. Mention what are the missing patterns that are generally observed?

The missing patterns that are generally observed are

Missing completely at random

Missing at random

Missing that depends on the missing value itself

Missing that depends on unobserved input variable

12. What is time series analysis?

Time series analysis can be done in two domains, frequency domain and the time domain. In Time series analysis the output of a particular process can be forecast by analyzing the previous data by the help of various methods like exponential smoothening, log-linear regression method, etc.

13. Explain what is correlogram analysis?

A correlogram analysis is the common form of spatial analysis in geography. It consists of a series of estimated autocorrelation coefficients calculated for a different spatial relationship. It can be used to construct a correlogram for distance-based data, when the raw data is expressed as distance rather than values at individual points.

14. What is a hash table?

In computing, a hash table is a map of keys to values. It is a data structure used to implement an associative array. It uses a hash function to compute an index into an array of slots, from which desired value can be fetched.

15. What are hash table collisions? How is it avoided?

A hash table collision happens when two different keys hash to the same value. Two data cannot be stored in the same slot in array.
To avoid hash table collision there are many techniques, here we list out two

Separate Chaining:
It uses the data structure to store multiple items that hash to the same slot.

Open addressing:
It explore for other slots using a second function and store item in first empty slot that is found

16. What is imputation? List out different types of imputation techniques?

During imputation we replace missing data with substituted values. The types of imputation techniques involve are Single Imputation

Hot-deck imputation: A missing value is imputed from a randomly selected similar record by the help of punch card

Cold deck imputation: It works same as hot deck imputation, but it is more advanced and selects donors from another datasets

Mean imputation: It involves replacing missing value with the mean of that variable for all other cases

Regression imputation: It involves replacing missing value with the predicted values of a variable based on other variables

Stochastic regression: It is same as regression imputation, but it adds the average regression variance to regression imputation Multiple Imputation

Unlike single imputation, multiple imputation estimates the values multiple times

17. Explain what is n-gram?

N-gram: An n-gram is a contiguous order of n items from a given order of text or speech. It is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1).

18. Explain what is the criteria for a good data model?

Criteria for a good data model includes

It can be easily consumed.

Large data modify in a good model should be scalable.

It should provide predictable performance.

A good model can adapt to modify in requirements.

19. Which imputation method is more favourable?

However single imputation is widely used, it does not reflect the variability created by missing data at random. So, multiple imputation is more favourable then single imputation in case of data missing at random.

20. Explain what is Clustering? What are the properties for clustering algorithms?

Clustering is a classification method that is applied to data. Clustering algorithm divides a data set into natural groups or clusters.

Properties for clustering algorithm are:

Hierarchical or flat.

Iterative.

Hard and soft.

Disjunctive.

Why TIB ACADEMY is No.1 in Bangalore?

All technologies under “ONE ROOF”

Pool of experts as “TRAINERS”

Tailor made syllabus

One-on-One attention to students

Flexible training modes

Mini POC’s for every module

Job oriented assistance

Affordable course fees

Flexible schedules

Flexible payment terms

Money back guarantee

Exclusive Seasonal discounts and offers

For more Interview Q & A

Amazon Web Services

Bigdata

Salesforce Developer

salesforce-interview-questions-and-answers

Salesforce Admin

Teradata

Quick Contact

Address - TIB Academy

Second Floor and Third Floor,
5/3, Varthur Road, Kundalahalli Gate,
Behind Kundalahalli Stop, Bangalore,
Karnataka 560066, India

Important Bigdata Analytics Interview Questions and Answers