String Manipulation Functions in R

String functions img

String Manipulations in R

In this section, we will discuss how to manipulate strings in the R programming language with different types of in-built functions provided by R packages.

A brief summary of function names and their corresponding actions or performance is described in the below table.

String Manipulations Functions Description
paste() To concatenate the strings.
format() To format the numerical values.
nchar() To count the characters in the string.
substr() To extract the specific characters from the string
grep() To extract a specific pattern from a string.
strsplit() To split the elements in a character string.
tolower() Translates lower to upper case strings
toupper() Translates upper to lower case strings

Let us go into detail with each function.

1.How to concatenating string : paste() function in R

The paste () function in R is used to combine two strings or more than two strings together to form a new string. In other words, the paste() function concatenates vectors after converting them into characters.

The paste function converts the arguments present in it to character strings and they are concatenated. If the arguments are vectors, they are concatenated term-by-term to give a character vector result. Vector arguments are recycled as needed, with zero-length arguments being recycled to " " only if recycle0 is not true or collapse is not NULL.

Syntax of paste() function in R


paste (..., sep = " ", collapse = NULL, recycle0 = FALSE)
paste0(..., collapse = NULL, recycle0 = FALSE)

Where,

  • … is one or more R objects, to be converted to character vectors. The R objects can be considered as string1, string2, string3, etc representing arguments to be combined.
  • sep is a character string to separate the arguments which are optional. Not NA_character.
  • collapse is an optional character string to eliminate the space in between two strings. Not NA_character. Not to eliminate the space between two words of a single string.
  • recycle0 is logical indicating if zero-length character arguments should lead to the zero-length character(0) after the sep-phase (which turns into " " in the collapse-phase, i.e., when the collapse is not NULL).
  • paste0(..., collapse) is equivalent to paste(..., sep = "", collapse), slightly more efficiently.

If a value is specified for collapse, the values in the result are then concatenated into a single string, with the elements being separated by the value of collapse.

Program using basic paste() function


#use paste() function
string1 = "Welcome"
string2 = "to"
string3 = "Learn etutorials" c string2, string3)
print(concatStr)

Output:


[1] "Welcome to Learn etutorials"

paste fn img

Program using paste() with sep


paste("Welcome", "to"," Learn etutorials ", sep = "----")    #paste with separator

The sep=”---“ is a character string that separates arguments welcome, to, Learn etutorials given inside a paste() function.

Output:


[1] "Welcome----to---- Learn etutorials "

When the separator is changed to sep=”___” the input and output becomes


> paste("Welcome", "to"," Learn etutorials ", sep = "___")    #paste with separator
[1] "Welcome___to___ Learn etutorials "

Program using collapse in paste() function


x <- c("example","of","paste","with","collapse")
print(x)
paste(x)

The vector object x contains several different character strings. When the paste function is applied over this vector object x, it returns the corresponding result the same as a vector output. Let us see the output to differentiate vector output(print x) and when paste() function is applied to vector x (paste(x))..


> print(x)
[1] "example"  "of"       "paste"    "with"     "collapse"
> paste(x)
[1] "example"  "of"       "paste"    "with"     "collapse"

You can observe that both results are the same. The character strings of vector object x are not merged with the paste function. So in order to merge elements of a vector object you need to specify the collapse argument.


paste(x,collapse = " ")

In the collapse argument, you need to specify another character string as a separator. Here collapse = “ “ implies the vector object x elements is going to merge with a blank value. When the code is run the output is


paste(x,collapse = " ")
[1] "example of paste with collapse"

Another character string is returned by merging all elements in the vector.

Difference between paste() and paste0()

  paste() paste0()
INPUT paste("Welcome", "to"," Learn etutorials ", sep = "") paste0("Welcome", "to"," Learn etutorials ")
OUTPUT  [1] "Welcome to Learn etutorials " [1] "Welcome to Learn etutorials "
INFERENCE Need to specify a separator ie sep. No need to specify a separator. By default uses an empty character string as a separator.

Inference: paste0() function is an alternative provided in the R programming language instead of the paste() function which is a more efficient and convenient function for merging strings. From the above table, it is clear that both paste() and paste0() provide similar output.

Note: the paste() coerces NA_character_, the character missing value to "NA" which may seem undesirable, e.g., when passing two character vectors, or very desirable, like in paste("the value of p is ", p).

2.nchar() function in R

The nchar() function in the R programming language counts the number of characters including spaces in a string. This function consists of a character vector as its arguments and returns a vector whose elements comprising of different sizes of the elements of a string. The nchar function in R is the fastest and most efficient way to find out if elements of a character vector are non-empty strings or not.

Syntax of nchar() function in R


nchar(x, type = "chars", allowNA = FALSE, keepNA = NA)
nzchar(x, keepNA = FALSE)

Where Arguments

  • x character vector, or a vector to be coerced to a character vector. Giving a factor is an error.
  • type character string: partial matching to one of c("bytes", "chars", "width").
  • allowNA logical: should NA be returned for invalid multibyte strings or "bytes"-encoded.
  • keepNA logical: should NA be returned when x is NA? If false, nchar() returns 2, as that is the number of printing characters used when strings are written to output, and nzchar() is TRUE. The default for nchar(), NA, means to use keepNA = TRUE unless type is "width".

Consider an example with a single character object or a variable string str0.To check how many number characters are contained in string str0 we apply nchar() function over str0. The function returns a corresponding value in the RStudio console. In our example applying nchar() in str0 returns a value of 27. Each character is counted including the space that separates two words within a given single string.

Program using nchar() function


# use nchar() function
#Returns the count of number of characters including space present in it

str0 = "welcome to Learn eTutorials"      #create a character object /string str0
print(nchar(str0))     #Apply nchar()  in R 

The output returns the character count value.

Output:


[1] 27

Now let us consider an example using vector datatype. A vector str1 is created using c() function with 4 elements welcome", "to", "Learn", "eTutorials" respectively. Applying nchar() over str1 returns the count of characters in each different word or string inside the given vector str1.For example the string “welcome” a word/element inside vector return a value 7 after application of nchar function and so on for remaining elements too.

Program using a vector data structure to use nchar()


str1=c("welcome","to","Learn","eTutorials")
print(str1)
nchar(str1)

Output:


> print(str1)
[1] "welcome"    "to"         "Learn"      "eTutorials"
> nchar(str1)
[1]  7  2  5 10

From the output, you can see how many characters correspond to each character string contained in the given vector.

How to perform nchar() with NA values?

In order to deal with NA values present within a given input, an optional argument keepNA provided in nchar() function.

Consider the code


vector0<- c(NA,"R", 'TUTORIAL', NULL)
nchar(vector0, keepNA = FALSE)

A vector named vector0 is created with a few elements inside it. Let us see first how the vector output gets displayed.


print(vector0)
[1] NA         "R"        "TUTORIAL"

The above strings are displayed in the R console after executing the shortcode. Now let us find what change does happens to the same code after applying nchar () and together with the addition of another optional argument 'keepNA'.


> nchar(vector0)
[1] NA  1  8

The NA value is excluded from counting the characters of each string of the given vector, i.e. vector0. The function counts the rest of the string's characters and returns the value as shown like “R” with 1 character, “TUTORIAL” with 8 characters, and so on.

When keepNA is set to TRUE, keepNA=TRUE, produces the same result as above.


nchar(vector0, keepNA = TRUE)
[1] NA  1  8

The NA is not counted, by changing the value from TRUE to FALSE, allows the nchar() function to count the NA if exists in the given input and returns its corresponding value.


> nchar(vector0, keepNA = FALSE)
[1] 2 1 8

The only difference between nchar() and nzchar() function is that nchar returns numeric values whereas nzchar() returns a logical value. Consider the nzchar() applied to the same vector0 created used in our examples, with an optional argument keepNA set to FALSE.


nzchar(vector0, keepNA = FALSE)

The output produced is


[1] TRUE TRUE TRUE

In case vectors contain any empty string represented as “ ” with a non-empty string, they will return a FALSE value.

Look at the table below

nchar() nzchar() Difference
vector0<- c(NA,"R", 'TUTORIAL', NULL) nchar(vector0, keepNA = FALSE) vector0<- c(NA,"R", 'TUTORIAL', NULL) nzchar(vector0, keepNA = FALSE) nchar() returns a numeric vector with same length as vector(vector0) as output.    
[1] 2 1 8 [1] TRUE TRUE TRUE  
nchar(vector0, keepNA = TRUE)   nzchar(vector0, keepNA = TRUE) nzchar() returns a logical vector with same length as vector(vector0) as output.    
[1] NA  1  8 [1]   NA TRUE TRUE  

 

3.format() function in R

The format() function in the R programming language deals with treating all vector elements as character strings by encoding the vector objects into a common format.

Syntax of format() function in R


format(x, trim = FALSE, digits = NULL, nsmall = 0L,
justify = c("left", "right", "centre", "none"),
width = NULL, na.encode = TRUE, scientific = NA,
big.mark   = "",   big.interval = 3L,
small.mark = "", small.interval = 5L,
decimal.mark = getOption("OutDec"),
zero.print = NULL, drop0trailing = FALSE, ...)

Where Arguments

  • x    any R object (conceptually); typically numeric.
  • trim logical; if FALSE, logical, numeric, and complex values are right-justified to a common width: if TRUE the leading blanks for justification are suppressed.
  • digits how many significant digits are to be used for numeric and complex x. The default, NULL, uses getOption("digits"). This is a suggestion: enough decimal places will be used so that the smallest (in magnitude) number has this many significant digits, and also to satisfy nsmall. (For the interpretation of complex numbers see signif.)
  • nsmall the minimum number of digits to the right of the decimal point in formatting real/complex numbers in non-scientific formats. Allowed values are 0 <= nsmall <= 20.
  • justify should a character vector be left-justified (the default), right-justified, centered, or left alone. Can be abbreviated.
  • width default method: the minimum field width or NULL or 0 for no restriction.
  • na.encode is logical: should NA strings be encoded? Note this only applies to elements of character vectors, not to numerical, complex nor logical NAs, which are always encoded as "NA".
  • scientific    Either a logical specifying whether elements of a real or complex vector should be encoded in scientific format or an integer penalty (see options("scipen")). Missing values correspond to the current default penalty.
  • ...    further arguments passed to or from other methods.

big.mark, big.interval, small.mark, small.interval, decimal.mark, zero.print, drop0trailing used for prettying (longish) numerical and complex sequences. Passed to prettyNum:

Example1: format() with arguments x,width,justify to format a string.


#use format() in R
# Place string to the left side
StrFormat1 <- format("Learn eTutorials", width = 25,  justify = "l")

# Place  string to the center
StrFormat2 <- format("Learn eTutorials", width = 25,  justify = "c")

# Place  string to the right
StrFormat3 <- format("Learn eTutorials", width = 25,  justify = "r")

# Display the different string placement
print(StrFormat1)
print(StrFormat2)
print(StrFormat3)

Output:


[1] "Learn eTutorials         "
[1] "    Learn eTutorials     "
[1] "         Learn eTutorials"

Example2: format() with arguments x(number),digits,nsmall,width,justify to format numbers.


# R program to illustrate format function

# Calling the format() function over  different arguments

# Rounding off the specified digits
numformat1 = format(1.45677, width = 10,digits=2)
numformat2 = format(1.45677,width = 10, digits=4)
numformat3 = format(1.45677,width = 10, justify = "r" ,digits=4)

print(numformat1)
print(numformat2)
print(numformat3)

# Getting the specified minimum number of digits
# to the right of the decimal point.
numformat3 = format(1.45677, nsmall=3)
numformat4 = format(1.45677, nsmall=7)
print(numformat3)
print(numformat4)

Output:


[1] "       1.5"
[1] "     1.457"
[1] "     1.457"
[1] "1.45677"
[1] "1.4567700"

The most useful arguments in format () to format a string is

  • width To produce minimum width.
  • trim  No padding with spaces when set to TRUE
  • justify  Takes the values "left", "right", "centre", and "none" to control the padding in strings.

The below arguments are useful for controlling the printing of numbers,

  • digits The number of digits to the right of the decimal place.
  • scientific use TRUE for scientific notation, FALSE for standard notation

4.substr() function in R

In R, the function substr()  extracts and returns part of a string from the whole given input string. For the process of extracting a part from a string, a start and stop integer needs to be considered. When the substr() is applied to a string, the extraction begins with the starting integer till it reaches the stop or ending integer. Once it reaches the referred stop integer the function returns the extracted substring.

Syntax


substr(x, start, stop)

Where Arguments

  • X   is a character vector or input string.
  • Start is an integer, represents the first element from where extraction begins
  • Stop is an integer, that represents the last element where extraction terminates.

Consider the example code


str = "hello Learn eTutorials learners"
substr(str,6,21)

The part of the string after extraction of string str from starting integer 6 and terminating integer 21 is


[1] " Learn eTutorial"

Similarly, a vector does perform in the same manner. Consider a vector str0 with a list of elements or strings like "hello"," Learn"," eTutorials"," learners".


str0=c("hello"," Learn"," eTutorials"," learners")
substr(str,6,21)

5.The grep()  function in R

The grep function or grep() in R facilitates the task of identifying or searching a specific pattern within a string. The grep function returns the number of instances of a searching pattern if they do find a match similar to the pattern in the string.

The grep() is a pattern matching and replacement function used in R. The grep, grepl, regexpr, gregexpr, regexec and gregexec search for matches to argument pattern within each element of a character vector:

sub and gsub perform replacement of the first and all matches respectively.

Syntax of grep() function in R


grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
     fixed = FALSE, useBytes = FALSE, invert = FALSE)

grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
      fixed = FALSE, useBytes = FALSE)

Where Arguments

  • pattern is a  character string that acts as a keyword to find a corresponding match among strings or given character vector.
  • x is the input character vector from which the pattern needs to be sought or found.
  • ignore.case is a boolean value set for the case sensitivity option while searching a pattern.
    ignore.case=TRUE Ignores case-sensitivity while pattern matching. Eg pattern =” Learn” Match with all other possible patterns irrespective of its case representation “learn”, “LEARN” etc.  
    ignore.case=FALSE Includes the case sensitivity while finding a match. Eg pattern =” Learn” Match only with “Learn” not with “learn”, “LEARN” etc. By default use FALSE against case sensitive option.
  • perl is a logical value either TRUE OR FALSE to identify whether Perl-compatible regexps be used or not.
  • value is logical to determine whether the output should return the position of the matching pattern.
  • fixed is logical.
    fixed=TRUE the pattern is a string to be matched as it is., which indicates pattern matching needs to be exact
    fixed=FALSE No restriction upon exact pattern match (default)
  • useBytes is logical to show that in the case of TRUE  the pattern is matched byte-by-byte else character-by-character.
    useBytes=TRUE byte-by-byte pattern matching
    useBytes=FALSE character-by-character matching
  • invert is logical to show that should output displays elements that do not match with the pattern or those match with the pattern.
    invert = TRUE If TRUE returns indices or values for elements that do not match.
    Invert = FALSE If FALSE returns indices or values for elements that do match.

Example of grep() function

Consider the code  below a vector named str0 with the following elements  "hello"," Learn"," eTutorials"," Learners"


str0=c("hello"," Learn"," eTutorials"," Learners")

Suppose you need to check a pattern “Learn” in the vector str0. You can use the grep() here.


grep("Learn",str0)

The output after applying to grep() in str0 is


[1] 2 4

The grep() searches pattern & returns a number of instances. For example, the search pattern “Learn” return the number of instances it occurs at 2, 4.

pattern img

Example of grep() using argument ignore.case

Consider the input character vector str0 with a list of character elements or strings.


str0=c("hello"," Learn"," eTutorials"," learners","LEARN","learN")

The table below shows use cases of ignore.case with grep()

grep("Learn",str0) [1] 2 By default ignore.case is FALSE ignores case sensitive strings while pattern matching.
grep("Learn",str0,ignore.case = FALSE) [1] 2 if FALSE, the pattern matching is case sensitive
grep("Learn",str0,ignore.case = TRUE) [1] 2 4 5 6 if TRUE, case is ignored during matching.
grep table

Example of grep() using arguments value,fixed,usebytes,invert


str0=c("hello"," Learn"," eTutorials"," learners","LEARN","learN")
> grep("Learn",str0,ignore.case = TRUE,value = TRUE,fixed = TRUE,useBytes = TRUE,invert  =TRUE)
[1] "hello"       " eTutorials" " learners"   "LEARN"       "learN"

Description against each argument in grep() for the above example

pattern "Learn"
x str0
ignore.case = TRUE Ignores case sensitivity and returns all matching patterns
value = TRUE Returns matching elements itself not indices of matched elements,
fixed = TRUE Exact match is returned
useBytes = TRUE Byte-by-byte matching
invert  =TRUE Returns values that do not match in output.

 

The output after applying to grep() in the str0 vector is


[1] "hello"       " eTutorials" " learners"   "LEARN"       "learN"

where these are patterns that do not match with the given pattern.

Consider what happens when all arguments in the above code are set to FALSE.


grep("Learn",str0,ignore.case = FALSE,value = FALSE,fixed = FALSE,useBytes = FALSE,invert = FALSE)
[1] 2

The value if FALSE, a vector containing the (integer) indices of the matches determined by grep() is returned. The pattern “Learn” matches exactly with indices 2 of the str0 vector.

Another simple example is to better understand the argument's value and invert. Returns non-matching elements as themselves ie as character vectors. The indices are not returned in these cases.
Try to understand each argument and observe the changes from below given code.


> str0=c("hello"," Learn"," eTutorials"," learners","LEARN","learN")    # vector str0
> str0
[1] "hello"       " Learn"      " eTutorials" " learners"   "LEARN"      
[6] "learN"      
> grep("Learn",str0)            # grep() to extract pattern
[1] 2
> grep("Learn",str0,invert = TRUE)           #Non matching elements are extracted using invert
[1] 1 3 4 5 6
> grep("Learn",str0,value = TRUE,invert = TRUE)        #value return character vector itself not indices.
[1] "hello"       " eTutorials" " learners"   "LEARN"       "learN"      
>

6.The strsplit() function in R

The strsplit() in R is a function to split the elements of a character vector. The strsplit() splits the given character vector(string) x into substrings based on the split argument provided within the syntax. The split argument indicates the character vector upon which the string is split into substrings.

Syntax of strsplit() function in R


strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)

Where the argument 

  • x is the input character string.
  • split is the character vector used for splitting the string.
  • fixed is logical. If TRUE match split exactly, otherwise use regular expressions
  • perl is logical to show whether Perl-compatible regexps be used?
  • useBytes is logical to indicate whether the pattern matching needs to be done byte-by-byte or character-by-character.
     

Consider the character variable or string str1 "hello Learn eTutorials learners" applying strsplit() with the split argument as “ ” returns a list of characters or elements of the string by splitting the string str1 at the blank spaces as "hello"      "Learn"      "eTutorials" "learners".


str1 = "hello Learn eTutorials learners"
> print(str1)
[1] "hello Learn eTutorials learners"
> strsplit(str1, " " )
[[1]]
[1] "hello"      "Learn"      "eTutorials" "learners"  

strsplit fn img

Consider another example to find the purpose of strsplit()


str1 = "hello Learn eTutorials learners"
print(str1)
strsplit(str1, "Learn eTutorials" )

The string is split at the position mentioned by argument split, here split is “Learn eTutorials”.Let us find its output


[1] "hello"   "learners"

strsplit fn img2

7.The tolower() and toupper() in R

The tolower() in R turns the character string into lowercase. The just opposite of tolower() is done by function toupper. The toupper() turns the character string to uppercase.

Description tolower() & toupper() translates characters in character vectors from upper to lower case and vice versa.
SYNTAX tolower(x)        toupper(x) where x is input character string.
Example   tolower(x)       x= "HELLO" > print(x) [1] "HELLO" > tolower(x) [1] "hello"
  toupper(x)       > x= "hello" > print(x) [1] "hello" > toupper(x) [1] "HELLO"