学习R语言：字符串操作

November 12, 2020 (最后修改: November 12, 2020)

本文内容来自《R 语言编程艺术》(The Art of R Programming)，有部分修改

尽管 R 是一门以数值向量和矩阵为核心的统计语言，但字符串同样极为重要。

字符串操作函数概述

`grep()`

grep(pattern, x) 在字符串向量 x 中搜索给定字符串 pattern，返回 x 的索引

grep(
  "Pole",
  c(
    "Equator",
    "North Pole",
    "South Pole"
  )
)

[1] 2 3

grep(
  "pole",
  c(
    "Equator",
    "North Pole",
    "South Pole"
  )
)

integer(0)

`nchar()`

nchar() 函数返回字符串 x 的长度

nchar("South Pole")

[1] 10

nchar(NA)

[1] NA

nchar(NULL)

integer(0)

`paste()`

paste() 函数将若干个字符串拼接

默认使用空格拼接

paste("North", "Pole")

[1] "North Pole"

sep 参数指定拼接字符

paste("North", "Pole", sep="")

[1] "NorthPole"

paste("North", "Pole", sep=".")

[1] "North.Pole"

paste("North", "and", "South", "Poles")

[1] "North and South Poles"

`sprintf()`

sprintf(...) 按一定格式把若干个组件组合成字符串

i <- 8
s <- sprintf("the square of %d is %d", i, i^2)
s

[1] "the square of 8 is 64"

`substr()`

substr(x, start, stop) 函数返回 start:stop 位置的子字符串

substring("Equator", 3, 5)

[1] "uat"

`strsplit()`

strsplit(x, split) 函数根据 split 将字符串 x 拆分

strsplit("6-16-2011", split="-")

[[1]]
[1] "6"    "16"   "2011"

`regexpr()`

regexpr(pattern, text) 在 text 中寻找 pattern，返回匹配的第一个子字符串的起始位置。

a <- regexpr("uat", "Equator")
a

[1] 3
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

attr(a, "match.length")

[1] 3

`gregexpr()`

gregexpr(pattern, text) 与 regexpr() 类似，但会返回全部匹配子字符串的起始配置

b <- gregexpr("iss", "Mississippi")
b

[[1]]
[1] 2 5
attr(,"match.length")
[1] 3 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

attr(b[[1]], "index.type")

[1] "chars"

正则表达式

支持正则表达式的函数举例：

grep()
grepl()
regexpr()
gregexpr()
sub()
gsub()
strsplit()

grep(
  "[au]",
  c(
    "Equator",
    "North Pole",
    "South Pole"
  )
)

[1] 1 3

grep(
  "o.e",
  c(
    "Equator",
    "North Pole",
    "South Pole"
  )
)

[1] 2 3

grep(
  "N..t",
  c(
    "Equator",
    "North Pole",
    "South Pole"
  )
)

[1] 2

grep(
  ".",
  c(
    "abc",
    "de",
    "f.g"
  )
)

[1] 1 2 3

grep(
  "\\.",
  c(
    "abc",
    "de",
    "f.g"
  )
)

[1] 3

扩展案例：检查文件名的后缀

test_suffix <- function(fn, suff) {
  parts <- strsplit(
    fn, ".",
    fixed=TRUE
  )
  nparts <- length(parts[[1]])
  return(parts[[1]][nparts] == suff)
}

注：fixed=TRUE 会将分隔符理解纯字符串，而不是正则表达式。当然，也可以使用 \\. 表示点号

测试

test_suffix("x.abc", "abc")

[1] TRUE

test_suffix("x.abc", "ac")

[1] FALSE

test_suffix("x.y.abc", "ac")

[1] FALSE

test_suffix("x.y.abc", "abc")

[1] TRUE

test_suffix_v2 <- function(fn, suff) {
  ncf <- nchar(fn)
  dotpos <- ncf - nchar(suff) + 1
  return(substr(fn, dotpos, ncf) == suff)
}

test_suffix_v2("x.y.abc", "abc")

[1] TRUE

test_suffix_v2("x.ac", "abc")

[1] FALSE

扩展案例：生成文件名

for (i in 1:5) {
  fname <- paste("q", i, ".pdf", sep="")
  pdf(fname)
  hist(rnorm(100, sd=i))
  dev.off()
}

for (i in 1:5) {
  fname <- sprintf("q%d.pdf", i)
  pdf(fname)
  hist(rnorm(100, sd=i))
  dev.off()
}

参考

学习 R 语言系列文章：

《快速入门》

《向量》

《矩阵和数组》

《列表》

《数据框》

《因子和表》

《编程结构》

《数学运算与模拟》

《面向对象编程》

《输入与输出》

本文代码请访问如下项目：

https://github.com/perillaroc/the-art-of-r-programming