Introduction to Statistics in R

In this tutorial, you will understand basic statistical concepts related to the R program. In R, statistics allows us to analyze, review and summarize the data with the help of some statistical tools available like mean, median, mode, variance, standard deviation and so on the list continues. These are the available built-in functions in R.

What are statistics in R?

Statistics is a branch of mathematics dealing with data collection and organization, analysis, interpretation, and presentation. It is used to process complex problems in the real-world which helps analysts look forward to meaningful trends and changes. Thus statistical learning refers to a set of tools for modeling and understanding the data. Thus statistics helps to collect, analyze and make conclusions from the data.
Statistics include the following procedures listed below:

  1. Identifying a problem
  2. Collecting relevant data (Gathering data)
  3. Analyzing data(Evaluating data)
  4. Finally derive a conclusion. (Summarizing the data)

R is a popular language adopted for data science and statistics. R is also known as statistical computing. The R programming language is used by professionals and data experts for modeling, financial data, marketing trends, and other analysis. Statistics in R is one of the major reasons for users to switch using the R programming language. This is because R has a rich collection of statistical techniques or functions and has sophisticated graphical and visualization capabilities for plots and graphs. Some of the reason that favors users are:

  • R is an open-source statistical programming language and is freely available.
  • R is highly flexible
  • R is a powerful scripting language.
  • R is cross-platform compatible.

Data analysis is required because we live in a data-rich world. Data is revolutionizing businesses and many other sectors.  Analyzing the data provides better insights.  Analyst review data so that they can reach meaningful conclusions and several statistics functions, principles, and algorithms are implemented to analyze raw data builds the statistical model,s and infer or predict the result.

The field of statistics has an influence over most of the domains of everyday life such as education, stock market, life science, insurance, Retail, etc.

There are a few statistical terminologies to be aware of before starting with statistics. They are

  1. Population:  This is the set of sources from which data has to be collected. For example a class of children, the population in a country, etc
  2. Sample: This is a subset of the population.
  3. Variable:  A variable is any characteristic, number, or quantity that can be measured or counted. It can also be called a data item. For example in a population based on height, weight, income, blood group, time, Gender, age, etc.
  4. Parameter:  A parameter is also known as a statistical model. A statistical parameter or population parameter is a quantity that indexes a family of a probability distributions.It can mean, a summation of the sample set. It basically gives a characteristic idea of the population in general.

Note: Statistics is a term used to summarize a process that an analyst uses to characterize a dataset.

Types of Analysis

In statistical analysis, the basic aspect is to obtain data. The analysis of an event can be done in any of the two ways: Quantitative or Qualitative.

Create another object shirt_sizes together with factor() function with argument as shirts to create shirt_sizes as a factor.

shirt_sizes = factor(shirts)
> shirt_sizes
[1] S   M   L   XL  XXL S   L  
Levels: L M S XL XXL

factor and levels which forms unique values within the factor.

  • Quantitative analysis is also known as statistical analysis, it is the science of collecting and interpreting objects (data) with numbers and graphs to identify trends. The quantitative data can be counted, measured, and expressed using numbers. This is considered a structured type.

    For example consider a vector weight created using c().The vector has the following weights displayed in the output. Weight is created as a vector with numeric data.


    [1] 45.7 30.0 67.4 89.3

    So we can say that the vector weight stores quantitative data that are numerics.

  • Qualitative analysis also known as Non-statistical analysis deals with qualitative data which is unstructured or semi-structured in nature using text, media, etc. Qualitative data is also known as categorical data. Qualitative data is stored using factors This data is used for hypotheses and interpretations. These data cannot be collected and analyzed using conventional methods.

    Take the example of different sizes of shirts. Create an object as a shirt.

    shirts = c("S","M","L","XL","XXL","S","L")

    Every element is inside double quotes because they represent character vector.

    Let us display the elements of vector shirts.

    [1] "S"   "M"   "L"   "XL"  "XXL" "S"   "L"

    Create another object shirt_sizes together with factor() function with the argument as shirts to create shirt_sizes as a factor.

    shirt_sizes = factor(shirts)
    > shirt_sizes
    [1] S   M   L   XL  XXL S   L  
    Levels: L M S XL XXL

    factor and levels which forms unique values within the factor.

Consider another example that summarizes both concepts, if you order a coffee from a restaurant, it is available in small, medium, or large which is a qualitative analysis. But if a store sells 50 regular coffees in a week it is quantitative analysis because there is a perfect count or number or statistics.

Another example of qualitative data or analysis is categorizing gender based on properties such as male or female, rating of a product, etc which are not actually measured but are categorized on the basis of their properties, attributes, labels, etc.