Data Science
Tutorial Study Image

Data Science Interview Questions

Data science can be explained as a collection of machine learning algorithms, methods, tools, etc, intending to find some useful information patterns from a huge amount of raw data available. It has immense applicabilities in AI, Statistics, Prediction, Medical fields, etc.

Differences between Data Science and Big data
Data Science Big Data
It is a collection of methods, tools, and algorithms to manage and retrieve information from raw data It is a huge amount of data sets that we collect from various sources which cannot be stored easily.
It has various applicability in speech and voice recognition, financial sector, web research, etc Popular in the field of communications, research, medical applications, etc
It retrieves the data patterns from the raw data using machine learning algorithms. It helps in solving the data storage issues and handling the huge amount of data
Important languages and methods used are Python, R, SQL, etc The methods used in this are Hadoop, Spark, Hive, etc


We have various methods and standards to check the quality of the data and some of them are

  • Data completeness
  • Data Consistency
  • Data Uniqueness
  • Data integrity
  • Data accuracy
  • Data conformity

As the name refers, supervised learning needs a supervisor while the machine learns from the dataset. 

In supervised learning, we will provide a sample data set which we call as training data to the machine before we use the actual data.

Examples of supervised learning are signature recognition, speech recognition, face detection, etc.

Unsupervised learning is similar to the working of a human brain. Unlike supervised learning, unsupervised learning doesn't have trained data, so the machine has to learn the patterns from the actual data. 

In simple terms, unsupervised learning has to learn from the actual data without a supervisor(trained dataset). 

Missing data is one of the major hurdles which has to solve in data science. in general there are two methods for dealing with the missing data. 

1. Debugging Methods: debugging methods include the data cleaning process which checks the data quality and takes the necessary steps to increase the data quality. Some of the important debugging methods are

  1. Searching the list of values
  2. Filtering questions
  3. Check for logical consistencies
  4. Check the level of representativeness

2. Imputation Method: In this method, we try to replace the missing values in the dataset by estimating the valid values and answers. We have mostly three types of imputation methods

  1. Random imputation
  2. Hot deck imputation
  3. Imputation of the mean 

Hadoop is not a programming language, Hadoop is a processing framework that is Open source, that helps to manage massive amounts of data processing and storage in big data machines and applications that work in pooled systems.

Apache Hadoop is a collection of different open source software and utility programs that helps to use different computer systems in a network to solve a complicated issue that needs a massive amount f data and processing.

Apache Hadoop also provides a high-end framework that can able to provide distributed storage using the programming model named MapReduce Model.

Hadoop means High Availability Distributed Object Oriented Platform.

It is a common and little subjective and confusing question that can be asked in an interview. Most of the big companies recommend Good data is more important and we cant build a good model without enough good data. 

The answer to this question depends on your personal experience and depends on the specific case if they provide an example or case.

It is a common and little subjective and confusing question that can be asked in an interview. Most of the big companies recommend Good data is more important and we cant build a good model without enough good data. 

The answer to this question depends on your personal experience and depends on the specific case if they provide an example or case.

It is an important type of command in the Hadoop system. It is called the File System Check command which helps us to check for errors in the file system. It also generates a report and sends that to the Hadoop distributed system.

A wide data format is a type of writing data where each row is unique and has many columns for different attributes. In wide data, format supposes we have an entity and it has many attributes, each and every attribute will be written in different columns for a single row (entity). In wide data format, there will be a large number of columns for each row. here we can group categorical data

A long data format is writing data that has only a limited number of columns for each row(entity). In this model, the row (entity) is not unique which will be repeated for different attributes for that entity. 

Wide and long data format


Interpolation is a method to find the data points which is not given but can be found in between the data sets.

It can be defined as making a prediction of a data point in between some data points. It means, calculating a function or data value based on the other data values in a series.

Unlike interpolation, In extrapolation, we have to find the missing data points which are beyond the given data sets.

It is like predicting the data values which are outside the data sets. The quality of an Extrapolation value depends on the method we choose to predict the value.

In data science, Data quantity and quality for a good output depend on different factors like

  • The method we use to calculate the output.
  • How much prediction perfection we need, and on different other factors.

The expected value is the required outcome or the average value that we will get after a huge number of predictions. It is a guess value or a theoretical value. 

HDFS file supports only exclusive write, which means the File system will take the input only from the first user who accesses the file which will be in microseconds difference. The second user's input values will be rejected.

Power analysis is a calculation that helps you to find or decide the minimum amount of sample size needed for your research or study in data science, given a significance level, effect size, etc

The normal distribution is also called the Gaussian distribution. it can be defined as a symmetric probability distribution with respect to the mean value. It helps us to show this data is closer to the mean and the frequency of data occurrence is a little far from the mean value.

Linear regressions are for calculating the variable's value using the values in the data set. It can be defined as a linear statistical method to find a relationship between two variables in a dataset.

The value we have to calculate or predict is called dependent value and the values we take for the prediction are called independent values.

Linear regressions use straight lines to represent the relation between the variables.


For checking the element, first, we have to create two lists and then we can use the function isin() to check the element present in list one is there in list two.

The major difference between KNN and K-means clustering is that,

KNN is a supervised learning algorithm that has a training dataset that is used to train the algorithm to find the patterns in the data. All process is done under supervision. Whereas,

K-means clustering is an unsupervised learning algorithm that does not have a training dataset to train the algorithm, where the algorithm has to find the patterns from the raw data similar to the working of the human brain. 


By using the function Concat() we can stack two sequences in data science, If we need to merge it horizontally, set the x-axis as 1. for example, suppose we have 2 sequences s1 and s2 then

test = test.concat([s1, s2], axis=1)

In data science, the function to_datetime() is used to convert the date strings to time series in a sequence.

Python and R language are coding languages that can be used in data science. both have wide functions and libraries which will be good to work with the test data. Some of the differences between the use of the languages are

Python and R differences
Python R language
Python has huge application-level purposes like web development, data analysis, etc R programming languages are mainly used for statistical modeling.
Python language is mainly used by data scientists, programmers, and data engineers R language are statisticians, data engineers, and data scientists
Python is simple and can be used by beginners to expert-level engineers The R language can be used by anyone who has no knowledge about programming or coding
Python package distribution is done by PyPi R programming language distribution is done by CRAN
Python has many visualizations tools like matplotlib, bokkeh, seabom R language uses the visualizations like ggplot2, plotly, ggiraph


An ROC curve is called Receiver Operating character Curve. It is a graph model that shows the performance of a classification model at the threshold value. This graph has 2 parameters which are 

  1. True positive rate TPR
  2. False-positive rate FPR

TPR can be calculated as TP / ( TP + FN ) and the FPR can be calculated as FP / FP + TN, Where 

  • TP = True positive
  • TN = True negative
  • FP = false positive
  • FN = False negative

AUC curve is the precision measurement curve with respect to the recall. It is called the Area Under the ROC Curve. AUC curve provides a total performance report against all classification threshold values which is a two-dimensional value. AUC precision can be calculated as 

P = TP/(TP + FP) and TP/(TP + FN) where

  • TP = True positive
  • TN = True negative
  • FP = false positive
  • FN = False negative

For creating a series from a list of values, we can use the function series().

Bias is the quantity of change in the predicted value of an algorithm (model) from the actual value. Bias happens due to the simplification of the models. High Bias leads to an phenomenon called underfitting. 

Underfitting is a phenomenon caused by high variance in models which gives poor results for both the test data and the training data set.

We can define variance as the change that happens in the results when we use different train data sets. variance is caused by the over-complexity of the data predicting models.

Overfitting is another problem that happens in the models, where the models will give the correct output for the training data but gives poor results when it is loaded with test data. 

The Confusion matrix is a table that is used to check the performance of a supervised classification algorithm. By using a Confusion matrix we can able to check the errors in the prediction models and also the type of errors. A confusion matrix is also called an Error matrix.

Selection bias is a kind of bais error that happens while we take the data for the research. It happens while we taking some random research data and it can affect the prediction outcome. It can be called as selection effect and it is of different types like

  • Sampling bias
  • Time interval
  • Data
  • Attrition


Markov chain is a systematic way of creating random sample values where the probability of output value depends only on the last value of the series. Markov model is created by Andrew Markov Data scientists are using Markov chain model to predict the output in certain cases.

True Positive Rate or TPR is the rate or probability that a positive value will be tested as positive. TPR will be the ratio of True Positives to the sum of true positives and false negatives. It can be calculated as 


False Positive Rate or FPR is the probability of a false trigger, which means it is the probability of showing a result as positive, but the result is actually a negative. The false Positive Rate can be defined as the ratio of false positives to the sum of true positives and false positives. It can be calculated as


R programming language offers a huge number of built-in functions and libraries that can help for visualizing the data like ggplot2, leaflet, lattice, etc. Using R language we can develop any kind of graph and helps in exploratory data analysis. R language supports more graphical requirements than any other language.

SVM is called Support Vector Machines which is a supervised machine learning algorithm that is used for classification. It can be used for classification and regression-type problems that are very popular because of its high accuracy and low computational cost. SVM consists of a plane called a hyperplane which separates the classes of variables.

Some of the kernels used in SVM are 

  • Polynomial Kernel
  • Gaussian 
  • Laplace RBF 
  • Sigmoid 
  • Hyperbolic Kernel

Deep learning is a branch of machine learning which creates algorithms that work in a similar way of the human nervous system and how the human brain gets knowledge from different situations. Deep learning involves a neural network that works on a huge amount of data to find patterns. Practical applications of deep learning are face recognition, Virtual assistants, driving in self drive cars, etc

A/B testing is an optimization method to find how will be the changes in some values of variables will affect the users or the audience and their reaction to that change.

A/B testing is usually done for web pages, if we need a change in one webpage, then that change will be shown only to some users to check their response. Then based on their response, we will apply that change as permanent to all users and to all pages of the website.

Box cox transformation is a method that is used to change a non-normal dependent variable into normal values. with this transformation, we can change a response variable so that the data meet the specific requirements.

Suppose we have a huge number of dimensions or details collected for a prediction, then choosing the correct dimension from such a huge number of unwanted dimensions or details is called a curse of dimensionality.

In simple words, while checking a dataset, suppose we have a huge number of columns that are not required, and then we have to extract the right columns needed huge effort.

The pickle module in python is used to serialize or deserialize an object. It can able to convert python objects like lists, dict etc into byte streams (0 and 1's). it is also called marshaling or flattening.

We can also convert back these byte streams to python objects which are referred to as unpickling. pickling helps to store an object in the drive.

There are some joins that can be used on table, which are

  1. Inner Join
  2. Left Join
  3. Outer Join
  4. Full Join
  5. Self Join
  6. Cartesian Join

Delete command is used to delete a number of rows from a table. Delete command is used with where clause to select the number of rows to get deleted. whereas, the Truncate command is used to delete all the rows from a table. Delete command can be reversed but the truncate command cannot.

Some of the command clauses that we can use with SQL programming is

  1. WHERE
  4. USING

The foreign key is a key in the DBMS table that help to make a link between two tables. It can be defined as a special key that belongs to a table but will act as the primary key of another table.

The table where the foreign key resides is called the child table and the table where the foreign key is the primary key is called the parent table.

Data integrity can be defined as the process or concept that helps to ensure the consistency, accuracy, and reliability of data that is to be stored in a database. Data integrity ensures the data quality and helps to make good predictions using that data. 

SQL database systems are database systems that are used to handle RDBMS (Relational Database Management Systems). In RDBMS systems, the data is structured, which means the data is organized in table format (Rows and Columns). 

NoSQL systems handle non-relational database systems, meaning the data is not structured. In such database systems, the data is not arranged in any format. Unstructured data is very common now that is coming from softwares, gadgets, etc

There are many database systems working under the principle of NoSQL database system. some of them are,

  • Redis
  • MongoDB
  • Cassandra
  • HBase
  • Neo4j

Hadoop is the key for a data scientist to handle a bug amount of unstructured data we get from raw devices and surveys. Secondly, the extension of Hadoop like Mahout will help the data scientist to implement the concept of machine learning and its algorithms to a huge amount of data.

In common the number of centroids(K) can be easily found by taking the square root of number of data points divided by 2. Using this method we get the approx value of a number of centroids or the value of K. To get a more precise value we have many methods like

  • elbow method
  • kernel method

Univariate analysis has many methods for descriptive statistical analysis that has only one variable. An example of univariate analysis is the pie charts. 

In Bivariate analysis, we are doing the analysis by taking the difference between two variables. in this analysis, there are two variables involved. An example of such analysis is the Scatterplot.

In Multivariate analysis, there will be more than 2 variables involved.

Statistics is the basic analysis that helps the analyst to learn about customer preferences. By using the statistical methods, the analyst get the data about the end user preferences like interests, retention, complaints, etc about the product and their expectations. Learning about the end users will help the analyst to make more reliable and robust products.

In Data Science, there are many statistical methods available and some of them are,

  • The Arithmetic mean
  • Graphic Display
  • Regression
  • Correlation
  • Time series

For handling big data, there are methods like

  • Sentiment Analysis
  • Semantic Analysis
  • A/B testing

Also, many machine learning methods are used to handle the huge data.

RDBMS is the short form of "Relational Database Management System", which is a database management system that works under the principle of a relational model. RDBMS was introduced by "E F Codd" and it is used to handle a huge amount of unordered data.

There are many databases that work under RDBMS and some of them are

  • SQL
  • MySQL
  • IBM
  • DB2
  • MYSQL Server
  • Microsoft Access

The Chi-square test is a statistical method that compares and measures the amount of perfection between the observed values and the theoretical values which we expect. 

F test can be done by using the formula 

F = Explained variance / Unexplained variance.

F test is used to check and compare 2 population variances. 

Association analysis is the method of finding the associations or the relations among the data

Association analysis is used for getting a better knowledge of, how the data entities are related to each other. 

The squared error can be calculated by taking the average of squares of the errors. while the absolute error can be calculated as the difference between the actual value and the measured value. 

API is the short form of Application Program Interface. API is a set of programs, routines, and protocols that helps to build some utility programs and applications.

By using the API, it will be very easy to develop complicated applications and software.

Collaborative filtering is a method of creating predictions automatically with the help of some recommendations of other users.

Combinatorics or discrete probability is very helpful for data scientists to study all types of predictive models.

Precision in Data Science can be defined as the number of instances retrieved that are valid divided by the total number of valid documents. 

Recall in DS can be defined as the number of valid instances that are retrieved divided by the total number of documents.

Market basket analysis is a type of modeling that is based on the concept that people who buy a certain type of product have more probability to buy another type of product. 

The central limit theorem states that the average will be close to normal as the sample size increase, without checking the distribution from the average taken.

Type 1 error is the non-acceptance of a true finding or a true null hypothesis. where the Type 2 error is the acceptance of a false finding or a false null hypothesis.

Linear regression is one of the vital type of prediction analysis method. Linear regression analysis helps to find (predict) the value of a variable using another variable.

In linear regression, the value we want to predict or find is called the dependent variable and the value of the variable that we use to predict the value is called the independent variable.

Group functions are used in data science to give the overall summary statistics of a dataset. There are many group functions, some of them are

  • MAX
  • MIN
  • AVG
  • SUM

The Root cause can be defined as the basic or the core failure of a system or process. To recover such issues deep and systematic analysis is required which is called Root Cause Analysis RCA. 

P value is used in the hypothesis test for finding the strength of the results. The value of the P will be between 0 and 1 which determines the hypothesis test strength. 

Causation will indicate any relationship between two events. Causation is used to represent the reason and its result. 

Cross-validation is a method to measure the performance of a model on a different data set. 

An example of cross-validation is the training and testing data, where the testing data is used to check the model and the testing data is used to build the model.

Logistic regression is used to estimate the probability of a binary event to happen using the independent variables. For example the chance for voted and not voted.

Cluster sampling is a probability-based method where the data analyst will split the population into different groups called clusters. Then a sample will be taken from the cluster and analysis will be done on that cluster sample pools. 

HDFS is a file system that supports exclusive write option, so while the first user is accessing the file, the system will reject any second user's inputs. 

Resampling methods have different uses and some of them are

  • Estimate the sample statistics precision
  • Exchange the labels on data points
  • Validating models

When conducting an experiment on a variable binomial distribution will help us to find the probability of the number of successes on that variable experiment.


Popular Programs