Data frames in R

Data frames are one of the most important as well as widely used reasons for starting to learn the R programming language. R is a statistical programming language that works with datasets. These datasets are comprised of observations or instances. All the observations have some variables associated with them.

Consider a data set of 6 students where each student is an instance and the properties relating to these students such as their roll number, names, grade, and section of the class to which they belong are the variables.

Now the question is how to store such information?

Can we use a matrix? no, because a matrix contains only elements or data of the same data type.

Here the roll number is numeric, the name is a character, and grade can be a character type that does not fit into a matrix but may fit into a list. A list can hold this information as a sublist with Roll_No, names, etc... But the structure of the list is not really fine to work with. One other limitation is in order to get any column need to write several R codes.

Then which data structure can overcome this situation, here comes the important and widely used concept of the data frame in R programming language.

What is a data frame in R?

In the R programming language data frame is a fundamental data structure that allows storing typical data in tables. A data frame is thus like a spreadsheet with rows and columns. In data frames, the rows correspond to observations while the column corresponds to variables. Each column can be a different vector.

The word frame represents a structure or a shape or a design or a pattern. So the meaning of the word data frame can be inferred as a structure/shape/design/pattern that outlines information regarding data or simply data organized in a well-defined frame.

Consider a spreadsheet or a table with data of a few school students with their roll number, name, and corresponding grade they scored in their examination. The column specifies the properties of each student.

Here, Rows = observations (students), Column = variables (roll number, name, grade)

Student Details

The same table can be represented in the R program with the data structure known as “Data Frame”. Each column of data can be of different vectors like numeric, character, etc. In the given student_table first column (c1) is composed of NUMERIC data type vectors, the second column(c2) and the third column (c3) consist of CHARACTER data type.

The data frames are similar to the list data structure. A list is consist of heterogeneous data elements or components listed in a list format where all vectors are of equal length. The data frame resembles a special case of a list except that the data are displayed in a frame. Like in the list a data frame also has vector components are of equal length.

It also seems to be similar to a matrix because of rows and columns. The big difference between a data frame with matrices is that a data frame can contain elements of different types one column can be characters, another may be numeric or logic that depends upon the requirements.

Still, in the data frame, there exists a restriction. The data type of all elements on a single or same column should be of the same type.

For example from the above table, you can identify the column with Roll_No variable has all elements of numeric data type.

Let us visualize the difference between a list and a data frame. To understand better the concept of the data frame in R.

Student Details

The same data as in the list is visualized in a tabular form in the data frame.

Now let us start our discussion with the practical part of R programming language by learning to create a data frame.

How to create a data frame in R?

In R programming language you don’t need to create a data frame yourself instead you can import data from another source. The other sources can be a CSV file, a relational database (eg SQL) or can come from other software packages like EXCEL, SPSS, etc.

R also provides ways to manually create data frames. In R a data frame is created using a built-in function data.frame().For creating a data frame for 6 observations (rows)and 3 variables(columns) you need to pass the data frame function 3 vectors that are of the same length six.

To create a data frame in R

data.frame()

The parentheses () enclose the vectors of any specified data type as parameters. The vectors such as vector1,vector2, etc are created outside the data.frame() function and these vector names are passed as parameters into the function parentheses.


<variable_name> = data.frame(<vector1><vector2><…………>)

Let us learn how to create a data frame for the above mentioned student_table.

The table consists of three columns (c1, c2, c3). Column c1 is the Roll_No of students in the table. Create a vector for Roll_No in c1. As we know a vector is created using the c() function.

Column c2 is Student_Name which is of character datatype like ALEX, BOB, etc and c3 is Grade of students of character datatype like B, A, and so on. Each column represents vectors (variables) and rows represent observations. You can infer some valid relation between each column and row values.

Let us create three vectors using the c() function


Roll_No = c(1,2,3,4,5,6)
Student_Name = c("ALEX","BOB","CARLES","DANIEL","FRANKO","HENRY")
Grade =  c("B","A","O","A","C","A")

 


D = data.frame(Roll_No, Student_Name,Grade )

The data frame is created which contains vectors Roll_No, Student_Name, Grade, and assigned to a variable D.

The steps you need to proceed with for creating a data frame are

  1. Create the variables or vectors of your choice or requirement.
  2. Create a data frame using data.frame () and assign to any other variable.
  3. Print the assigned variable to display the data frame.

The full source code for creating the data frame for student_table is given below.

Program to create a data frame


#create 3 vectors Roll_No, Student_Name, Grade using c()
Roll_No = c(1,2,3,4,5,6) #numeric type 
Student_Name = c("ALEX","BOB","CARLES","DANIEL","FRANKO","HENRY") #character type
Grade = c("B","A","O","A","C","A") #character type

#Created a data frame and assigned to Variable D
D = data.frame(Roll_No,Student_Name,Grade )

print(D)

Output:


  Roll_No Student_Name Grade
1    1            ALEX     B
2    2             BOB     A
3    3          CARLES     C
4    4          DANIEL     A
5    5          FRANKO     C
6    6           HENRY     A

The snippet of the same code in RStudio is

snippet

Note: The length of three vectors Roll_No, Student_Name, Grade are of the same length.

How to calculate the rows and columns in a data frame in R?

The number of rows and columns in a data frame can be calculated using the following syntax

The syntax for calculating rows


nrow()

The parentheses contain the name of the variable to which the data frame is assigned. In our previous examples, we stored the data frame of student_table in variable D. Let us check the numbers of rows in student_table,


> nrow(D)
[1] 6

The nrow(D) returns a value of 6, which denotes there are 6 rows in the table.

The syntax for calculating columns


ncol()

To calculate the number of columns inside a data frame use the ncol(), the parentheses contain the name of the data frame whose number of columns needs to be checked.

Let us check the number of columns in our student_table,


> ncol(D)
[1] 3

You can see number 3, the function ncol() returns a value 3 which states there are 3 columns in our data frame D.

Student Details

Both the row and column numbers can be calculated using a single function.

The syntax for calculating both row and column numbers


dim()

Let us check the output of the number of rows and columns for student_table,


> dim(D)
[1] 6 3

The function returns the number of rows(6) followed by the number of columns(3).

How to extract the name of a column of the R data frame?

The R programming language has built-in function names() that support setting and getting the name of R data frame components such as the name of a column. The names() function is a generic function to access the name attribute of R objects.

In our vector, matrix tutorial we discussed the names() function where we assigned values as names to the vector, matrix data structures.

Here names() function gets the name corresponding to a column number. The names() function encloses the name of the data frame you created from which the name needs to be accessed along with mentioning the column number inside the square bracket [ ].

The Syntax to extract the name of a column in the R data frame


names()[]

In our previous session, we created a data frame D with three vectors, these vectors form the columns in the data frame with values of vectors as rows.

Consider the student_table, the names(D) function checks the data frame with the name D followed by evaluating the column number of D specified in [ ] as 1. The function evaluates [1] as the first column in the data frame D and returns the name Roll_No.

Example


> names(D)[1]
[1] "Roll_No"

The names()[] gives the name of a column in the data frame

Student Details
Student Details

The corresponding name of the column retrieved is Roll_No. You can observe the output from the image to get a better idea of names().

What is the head and tail function in the R data frame?

The head() and tail() function are built-in functions in the R programming language that helps a user to view the output data based on their preference. In some cases, if a large dataset or database is imported into your R   program, it may contain many columns and rows. In our student_table example, there are only 3 columns and 6 rows which are easy to observe and make our inferences.

But that is not the case with a huge database, so the user can view the initial set of databases (very top) using head() function and a final set of databases (very bottom) using tail() function.

Syntax to find the top dataset


head(<name of data frame>)

Syntax to find the bottom dataset


head(<name of data frame>)

How to get the structure of a data frame in R?

The structure of a data frame can be found by using a function known as str().

Syntax to get the structure of the R data frame


str(<name of data frame>)

Let us check the structure of our student_table, the D data frame created above using the str() function

To get the structure of the data frame


str(D)

The function str() to determine the structure of data frame D returns the following observations as output.

This is the structure of a data frame with 6 observations of three variables. The three variables are Roll_No, Student_Name, and Grade. Further, it provides the details such as variable Roll_No is of numeric data type (num) with following observations such as 1,2,3,4,5,6.

The variable or vector Student_Name is of character data type (chr) with the following observations as "ALEX" "BOB" "CARLES" "DANIEL" ...

The last one is Grade a vector of character type(chr) with observations "B" "A" "O" "A" ...

Output:

'data.frame': 6 obs. of  3 variables:
$ Roll_No     : num  1 2 3 4 5 6
$ Student_Name: chr  "ALEX" "BOB" "CARLES" "DANIEL" ...
$ Grade       : chr  "B" "A" "O" "A" ...
>

How to find a summary of an R data frame?

The summary of an R data frame is found using a built-in function in R known as the summary() function.

Syntax to find a summary of the R data frame


summary(<name of data frame>)

The function name summary is followed by the names of data frames inside the parentheses (). Remember the data frame D we created using data.frame() function


#create 3 vectors Roll_No, Student_Name, Grade  using c()
Roll_No = c(1,2,3,4,5,6)            #numeric type                        
Student_Name = c("ALEX","BOB","CARLES","DANIEL","FRANKO","HENRY")  #character type
Grade =  c("B","A","O","A","C","A") #character type

#Created a data frame and assigned to Variable D
D = data.frame(Roll_No,Student_Name,Grade )
print(D)

The resulting output is a table format as shown in the snippet

Student Details

Now let us check the summary of the same data frame D using the summary() function provided by R.


summary(D)

Where summary is the function name and D is the name of the data frame we created. Let us see the output produced by the above syntax

Output:


> summary(D)
    Roll_No     Student_Name          Grade          
 Min.   :1.00      Length:6                  Length:6          
 1st Qu.:2.25    Class :character       Class :character  
 Median :3.50   Mode  :character   Mode  :character  
 Mean   :3.50                                        
 3rd Qu.:4.75                                        
 Max.   :6.00  

The same result in RStudio is shown by snippet for a better understanding of the D data frame summary.

Student Details

The summary shows that

  • the Vector Roll_No has a minimum number (min) of 1 and a maximum number (max) of 6.
  • The first quartile (1st Qu. ) is given a 2.25 value.
  • The median of Roll_No is given with a value of 3.50.
  • The mean is 3.50 The third quartile (3rd Qu.) is 4.75.

Similarly, the summary for both vectors Student_Name and Grade is

  • They have a length of 6.
  • Their data type is character.
  • Its mode is also character.

How to add a column in an R data frame?

To add a column in the R data frame the syntax given below is followed.

Syntax to add column in R data frame


<name of existing data frame> <$><name of new vector> = c(<value1>,<value2>………)

Let us see the syntax with our D data frame we discussed in the above sessions


D$Section <- c("A","B","A","B","A","B")
Or Secti
D Secti

To the data frame D a new vector is added (section), the section is a vector with values “A” and “B” created using c().

Let us see the output after adding a new column section to our data frame D.

Output:


        Roll_No Student_Name Grade Section
1       1         ALEX     B       A
2       2          BOB     A       B
3       3       CARLES     O       A
4       4       DANIEL     A       B
5       5       FRANKO     C       A
6       6        HENRY     A       B

The snippet for the same piece of code is

Student Details

From the snippet, you can infer the new column section. The initial set of data represents the old data frame D with 3 columns only and just below that, you can view the same data set with an additional column section. It is clear the new data frame now contains 4 columns such as Roll_No, Student_Name, Grade, Section.

In order to identify the number of columns use ncol() function which we learned in our previous session of the same tutorial.

Let us see the number of columns in the new D data frame


> ncol(D)
[1] 4
>

How to calculate the rows and columns in a data frame in R?

The cbind() function also works in adding column.(Refer matrix tutorial to review cbind())


cbind(D,Section)

The output after using cbind() function is adding the column named Section as shown in the below snippet.

Student Details

How to subset a data frame in R?

There are two subsetting techniques included in R to subset the data frames.

  1. [   Single bracket for subsetting.
  2. [[   Double bracket for subsetting.

Consider the new data frame D we created with 4 columns and 6 rows in our just above topic.

Student Details


Let us start by selecting single elements from the data frame.

Suppose you need to select the name of a student with Roll_No 5 at the fifth row of our data frame. We subset using a single bracket  [.The name of the data frame(D) followed by a single bracket[ with parameters representing the observation(row index) number (5) comes first and then second comes column number(2).

Let us see the syntax with an example


Data.frame_name[nrow,ncol]

Example :


D[5,2]
[1] "FRANKO"

We get the student name “FRANKO” by mentioning the row and column number from the data frame. Giving the column name instead of the column number also produces the same result as shown below


> D[5,"Student_Name"]
[1] "FRANKO"

To get an entire row or column from a data frame


D[<nrow>  , ]                    #to subset entire row
D[ ,  <ncol>]                    #to subset entire column

To retrieve or subset all information of “FRANKO” you can use the below code


> D[5,]
  Roll_No Student_Name Grade Section
5       5       FRANKO     C       A

The result is a data frame with a single observation.

Now to get Student_Name the syntax D[ , <ncol>] will work.


> D[,"Student_Name"]
[1] "ALEX"   "BOB"    "CARLES" "DANIEL" "FRANKO" "HENRY"

The result is like a vector because columns contain elements of the same data type.

Subsetting multiple observations in the R data frame


> D[c(2,5), c("Student_Name","Grade")]
  Student_Name Grade
2          BOB     A
5       FRANKO     C

The row number or observations are provided with c() and column names are given next inside c() to get the values corresponding to the row and column.