Data Science Interview Questions

1Explain the concept of Data Science?

Data science can be explained as a collection of machine learning algorithms, methods, tools, etc, intending to find some useful information patterns from a huge amount of raw data available. It has immense applicabilities in AI, Statistics, Prediction, Medical fields, etc.

2Explain some differences between Data Science and Big Data?

Differences between Data Science and Big data
Data Science	Big Data
It is a collection of methods, tools, and algorithms to manage and retrieve information from raw data	It is a huge amount of data sets that we collect from various sources which cannot be stored easily.
It has various applicability in speech and voice recognition, financial sector, web research, etc	Popular in the field of communications, research, medical applications, etc
It retrieves the data patterns from the raw data using machine learning algorithms.	It helps in solving the data storage issues and handling the huge amount of data
Important languages and methods used are Python, R, SQL, etc	The methods used in this are Hadoop, Spark, Hive, etc

3Explain the criteria to check for the quality of data?

We have various methods and standards to check the quality of the data and some of them are

4What is supervised learning in data science?

As the name refers, supervised learning needs a supervisor while the machine learns from the dataset.

In supervised learning, we will provide a sample data set which we call as training data to the machine before we use the actual data.

Examples of supervised learning are signature recognition, speech recognition, face detection, etc.

5Explain the concept of unsupervised learning in Data Science?

Unsupervised learning is similar to the working of a human brain. Unlike supervised learning, unsupervised learning doesn't have trained data, so the machine has to learn the patterns from the actual data.

In simple terms, unsupervised learning has to learn from the actual data without a supervisor(trained dataset).

6How can we deal with missing data in data science?

Missing data is one of the major hurdles which has to solve in data science. in general there are two methods for dealing with the missing data.

1. Debugging Methods: debugging methods include the data cleaning process which checks the data quality and takes the necessary steps to increase the data quality. Some of the important debugging methods are

Searching the list of values
Filtering questions
Check for logical consistencies
Check the level of representativeness

2. Imputation Method: In this method, we try to replace the missing values in the dataset by estimating the valid values and answers. We have mostly three types of imputation methods

Random imputation
Hot deck imputation
Imputation of the mean

7Explain Hadoop?

Hadoop is not a programming language, Hadoop is a processing framework that is Open source, that helps to manage massive amounts of data processing and storage in big data machines and applications that work in pooled systems.

Apache Hadoop is a collection of different open source software and utility programs that helps to use different computer systems in a network to solve a complicated issue that needs a massive amount f data and processing.

Apache Hadoop also provides a high-end framework that can able to provide distributed storage using the programming model named MapReduce Model.

8Elaborate the term HADOOP?

Hadoop means High Availability Distributed Object Oriented Platform.

9Good data or Good model which is more important?

It is a common and little subjective and confusing question that can be asked in an interview. Most of the big companies recommend Good data is more important and we cant build a good model without enough good data.

The answer to this question depends on your personal experience and depends on the specific case if they provide an example or case.

10Good data or Good model which is more important?

The answer to this question depends on your personal experience and depends on the specific case if they provide an example or case.

11Can you explain the term fsck?

It is an important type of command in the Hadoop system. It is called the File System Check command which helps us to check for errors in the file system. It also generates a report and sends that to the Hadoop distributed system.

12Explain the Wide data format and Long data format?

A wide data format is a type of writing data where each row is unique and has many columns for different attributes. In wide data, format supposes we have an entity and it has many attributes, each and every attribute will be written in different columns for a single row (entity). In wide data format, there will be a large number of columns for each row. here we can group categorical data

A long data format is writing data that has only a limited number of columns for each row(entity). In this model, the row (entity) is not unique which will be repeated for different attributes for that entity.

13Explain Interpolation in Data science?

Interpolation is a method to find the data points which is not given but can be found in between the data sets.

It can be defined as making a prediction of a data point in between some data points. It means, calculating a function or data value based on the other data values in a series.

14What is Extrapolation in Data science?

Unlike interpolation, In extrapolation, we have to find the missing data points which are beyond the given data sets.

It is like predicting the data values which are outside the data sets. The quality of an Extrapolation value depends on the method we choose to predict the value.

15Can you suggest how much data is needed for predicting a correct output in data science?

In data science, Data quantity and quality for a good output depend on different factors like

The method we use to calculate the output.
How much prediction perfection we need, and on different other factors.

16Define the expected value in data science?

The expected value is the required outcome or the average value that we will get after a huge number of predictions. It is a guess value or a theoretical value.

17Can you tell the outcome if two users use the HDFS file at the same time?

HDFS file supports only exclusive write, which means the File system will take the input only from the first user who accesses the file which will be in microseconds difference. The second user's input values will be rejected.

18Explain the term power analysis?

Power analysis is a calculation that helps you to find or decide the minimum amount of sample size needed for your research or study in data science, given a significance level, effect size, etc

19Explain the concept of normal distribution?

The normal distribution is also called the Gaussian distribution. it can be defined as a symmetric probability distribution with respect to the mean value. It helps us to show this data is closer to the mean and the frequency of data occurrence is a little far from the mean value.

20What is mean by linear regression?

Linear regressions are for calculating the variable's value using the values in the data set. It can be defined as a linear statistical method to find a relationship between two variables in a dataset.

The value we have to calculate or predict is called dependent value and the values we take for the prediction are called independent values.

Linear regressions use straight lines to represent the relation between the variables.

21How to check an element in a list is present in another list in data science?

For checking the element, first, we have to create two lists and then we can use the function isin() to check the element present in list one is there in list two.

22What is the difference between the KNN and K-means clustering?

The major difference between KNN and K-means clustering is that,

KNN is a supervised learning algorithm that has a training dataset that is used to train the algorithm to find the patterns in the data. All process is done under supervision. Whereas,

K-means clustering is an unsupervised learning algorithm that does not have a training dataset to train the algorithm, where the algorithm has to find the patterns from the raw data similar to the working of the human brain.

23Name the method to stack two or more sequences in data science?

By using the function Concat() we can stack two sequences in data science, If we need to merge it horizontally, set the x-axis as 1. for example, suppose we have 2 sequences s1 and s2 then

test = test.concat([s1, s2], axis=1)

24what is the use of the function 'to_datetime()' function in data science?

In data science, the function to_datetime() is used to convert the date strings to time series in a sequence.

25What is the difference between Python and R language in Data Science?

Python and R language are coding languages that can be used in data science. both have wide functions and libraries which will be good to work with the test data. Some of the differences between the use of the languages are

Python and R differences
Python	R language
Python has huge application-level purposes like web development, data analysis, etc	R programming languages are mainly used for statistical modeling.
Python language is mainly used by data scientists, programmers, and data engineers	R language are statisticians, data engineers, and data scientists
Python is simple and can be used by beginners to expert-level engineers	The R language can be used by anyone who has no knowledge about programming or coding
Python package distribution is done by PyPi	R programming language distribution is done by CRAN
Python has many visualizations tools like matplotlib, bokkeh, seabom	R language uses the visualizations like ggplot2, plotly, ggiraph

26What is meant by ROC curve?

An ROC curve is called Receiver Operating character Curve. It is a graph model that shows the performance of a classification model at the threshold value. This graph has 2 parameters which are

True positive rate TPR
False-positive rate FPR

TPR can be calculated as TP / ( TP + FN ) and the FPR can be calculated as FP / FP + TN, Where

TP = True positive
TN = True negative
FP = false positive
FN = False negative

27What is AUC curve in Data science?

AUC curve is the precision measurement curve with respect to the recall. It is called the Area Under the ROC Curve. AUC curve provides a total performance report against all classification threshold values which is a two-dimensional value. AUC precision can be calculated as

P = TP/(TP + FP) and TP/(TP + FN) where

TP = True positive
TN = True negative
FP = false positive
FN = False negative

28Suppose you have a list of data in panda, how will you create a series?

For creating a series from a list of values, we can use the function series().

29Explain Bias in data Science?

Bias is the quantity of change in the predicted value of an algorithm (model) from the actual value. Bias happens due to the simplification of the models. High Bias leads to an phenomenon called underfitting.

30What is underfitting in models or algorithms?

Underfitting is a phenomenon caused by high variance in models which gives poor results for both the test data and the training data set.

31Explain variance in data science?

We can define variance as the change that happens in the results when we use different train data sets. variance is caused by the over-complexity of the data predicting models.

32Explain the concept of overfitting?

Overfitting is another problem that happens in the models, where the models will give the correct output for the training data but gives poor results when it is loaded with test data.

33Explain the confusion matrix in data science?

The Confusion matrix is a table that is used to check the performance of a supervised classification algorithm. By using a Confusion matrix we can able to check the errors in the prediction models and also the type of errors. A confusion matrix is also called an Error matrix.

34Explain selection bias in terms of data science?

Selection bias is a kind of bais error that happens while we take the data for the research. It happens while we taking some random research data and it can affect the prediction outcome. It can be called as selection effect and it is of different types like

Sampling bias
Time interval
Data
Attrition

35Explain the Markov chain?

Markov chain is a systematic way of creating random sample values where the probability of output value depends only on the last value of the series. Markov model is created by Andrew Markov Data scientists are using Markov chain model to predict the output in certain cases.

36Explain the True Positive Rate (TPR) and False Positive Rate (FPR)?

True Positive Rate or TPR is the rate or probability that a positive value will be tested as positive. TPR will be the ratio of True Positives to the sum of true positives and false negatives. It can be calculated as

TPR=TP/TP+FN

False Positive Rate or FPR is the probability of a false trigger, which means it is the probability of showing a result as positive, but the result is actually a negative. The false Positive Rate can be defined as the ratio of false positives to the sum of true positives and false positives. It can be calculated as

FPR=FP/TP+FP

37What is the benefit of using the R language for the visualization of data?

R programming language offers a huge number of built-in functions and libraries that can help for visualizing the data like ggplot2, leaflet, lattice, etc. Using R language we can develop any kind of graph and helps in exploratory data analysis. R language supports more graphical requirements than any other language.

38Explain the SVM?

SVM is called Support Vector Machines which is a supervised machine learning algorithm that is used for classification. It can be used for classification and regression-type problems that are very popular because of its high accuracy and low computational cost. SVM consists of a plane called a hyperplane which separates the classes of variables.

39Write some kernels that are used in SVM?

Some of the kernels used in SVM are

Polynomial Kernel
Gaussian
Laplace RBF
Sigmoid
Hyperbolic Kernel

40What is mean by deep learning in data science?

Deep learning is a branch of machine learning which creates algorithms that work in a similar way of the human nervous system and how the human brain gets knowledge from different situations. Deep learning involves a neural network that works on a huge amount of data to find patterns. Practical applications of deep learning are face recognition, Virtual assistants, driving in self drive cars, etc

41What is meant by A/B testing? where do we use it?

A/B testing is an optimization method to find how will be the changes in some values of variables will affect the users or the audience and their reaction to that change.

A/B testing is usually done for web pages, if we need a change in one webpage, then that change will be shown only to some users to check their response. Then based on their response, we will apply that change as permanent to all users and to all pages of the website.

42Explain about box cox transformation?

Box cox transformation is a method that is used to change a non-normal dependent variable into normal values. with this transformation, we can change a response variable so that the data meet the specific requirements.

43Explain curse dimensionality?

Suppose we have a huge number of dimensions or details collected for a prediction, then choosing the correct dimension from such a huge number of unwanted dimensions or details is called a curse of dimensionality.

In simple words, while checking a dataset, suppose we have a huge number of columns that are not required, and then we have to extract the right columns needed huge effort.

44Explain about pickle module in python?

The pickle module in python is used to serialize or deserialize an object. It can able to convert python objects like lists, dict etc into byte streams (0 and 1's). it is also called marshaling or flattening.

We can also convert back these byte streams to python objects which are referred to as unpickling. pickling helps to store an object in the drive.

45Mention different forms of joins in a table?

There are some joins that can be used on table, which are

46What is the difference between 'delete' and 'truncate' commands?

Delete command is used to delete a number of rows from a table. Delete command is used with where clause to select the number of rows to get deleted. whereas, the Truncate command is used to delete all the rows from a table. Delete command can be reversed but the truncate command cannot.

47Name some command clauses that we can use with SQL programming?

Some of the command clauses that we can use with SQL programming is

WHERE
GROUP BY
ORDER BY
USING

48Explain foreign Key in SQL?

The foreign key is a key in the DBMS table that help to make a link between two tables. It can be defined as a special key that belongs to a table but will act as the primary key of another table.

The table where the foreign key resides is called the child table and the table where the foreign key is the primary key is called the parent table.

49Define data integrity?

Data integrity can be defined as the process or concept that helps to ensure the consistency, accuracy, and reliability of data that is to be stored in a database. Data integrity ensures the data quality and helps to make good predictions using that data.

50Mention the difference between SQL and noSQL?

SQL database systems are database systems that are used to handle RDBMS (Relational Database Management Systems). In RDBMS systems, the data is structured, which means the data is organized in table format (Rows and Columns).

NoSQL systems handle non-relational database systems, meaning the data is not structured. In such database systems, the data is not arranged in any format. Unstructured data is very common now that is coming from softwares, gadgets, etc

51Specify some NoSQL databases that are used in common?

There are many database systems working under the principle of NoSQL database system. some of them are,

Redis
MongoDB
Cassandra
HBase
Neo4j

52Explain the importance of Hadoop in Data Science?

Hadoop is the key for a data scientist to handle a bug amount of unstructured data we get from raw devices and surveys. Secondly, the extension of Hadoop like Mahout will help the data scientist to implement the concept of machine learning and its algorithms to a huge amount of data.

53Explain the method to get the value of 'K' in K means clustering?

In common the number of centroids(K) can be easily found by taking the square root of number of data points divided by 2. Using this method we get the approx value of a number of centroids or the value of K. To get a more precise value we have many methods like

elbow method
kernel method

54Explain the difference between univariate, multivariate, and bivariate analysis?

Univariate analysis has many methods for descriptive statistical analysis that has only one variable. An example of univariate analysis is the pie charts.

In Bivariate analysis, we are doing the analysis by taking the difference between two variables. in this analysis, there are two variables involved. An example of such analysis is the Scatterplot.

In Multivariate analysis, there will be more than 2 variables involved.

55Explain about the need of statistics in Data Science?

Statistics is the basic analysis that helps the analyst to learn about customer preferences. By using the statistical methods, the analyst get the data about the end user preferences like interests, retention, complaints, etc about the product and their expectations. Learning about the end users will help the analyst to make more reliable and robust products.

56Name the statistical methods available in Data Science?

In Data Science, there are many statistical methods available and some of them are,

The Arithmetic mean
Graphic Display
Regression
Correlation
Time series

For handling big data, there are methods like

Sentiment Analysis
Semantic Analysis
A/B testing

Also, many machine learning methods are used to handle the huge data.

57Explain the concept of RDBMS?

RDBMS is the short form of "Relational Database Management System", which is a database management system that works under the principle of a relational model. RDBMS was introduced by "E F Codd" and it is used to handle a huge amount of unordered data.

58Name some examples of RDBMS models?

There are many databases that work under RDBMS and some of them are

59Explain the chi-square test in Data Science?

The Chi-square test is a statistical method that compares and measures the amount of perfection between the observed values and the theoretical values which we expect.

60Explain the F test in Data Science?

F test can be done by using the formula

F = Explained variance / Unexplained variance.

F test is used to check and compare 2 population variances.

Estimate the sample statistics precision
Exchange the labels on data points
Validating models

80Explain the concept of the binomial distribution?

When conducting an experiment on a variable binomial distribution will help us to find the probability of the number of successes on that variable experiment.

Popular Programs

VIEW ALL

OtherTutorials

VIEW ALL

Learn Data Science

Data Science Interview Questions

Popular Programs

Popular Tutorials

OtherTutorials

Python

Python

C

C

Java

Java

Machine Learning

Machine Learning

R

R

PHP

PHP

Golang

Golang

Artificial Intelligence

Artificial Intelligence

HTML

HTML

Cyber Security

Cyber Security

C++

C++

Data Science

Data Science

Join Us