25

I have a large data set with thousands of columns. The column names include various unwanted characters as follows:

col1_3x_xxx
col2_3y_xyz
col3_3z_zyx

I would like to remove all character strings starting with "_3" from all column names to be left with clean:

col1
col2
col3

What is the most efficient way to do this for 5000+ columns?

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
pyne
  • 507
  • 1
  • 5
  • 16
  • 18
    `names(your_data) = gsub(pattern = "_3*", replacement = "", x = names(your_data))` – Gregor Thomas Jun 13 '16 at 23:29
  • Also, please don't use the RStudio tag unless your question concerns RStudio. (You wouldn't use a Microsoft Word tag for a grammar question just because you're using Word to write something.) – Gregor Thomas Jun 13 '16 at 23:31
  • You can also use: `sapply(strsplit(names(df), "_3"), \`[[\`, 1)`. – JasonWang Jun 13 '16 at 23:33
  • 1
    @Gregor I think you need a `.` in `"_3.*"` or else you're looking for a 3 repeated 0 or more times, and @pynewbie if you're look for efficiency with 5000+ columns, there are _milliseconds_ to be gained by adding `perl = TRUE` – Jota Jun 14 '16 at 00:21

4 Answers4

19

certainly late for this answer, but just in case someone is looking for a solution

colnames(df1)[col] <-  sub("_3.*", "", colnames(df1)[col])

And if you have multiple columns :

for ( col in 1:ncol(df1)){
    colnames(df1)[col] <-  sub("_3.*", "", colnames(df1)[col])
}
Rene Chan
  • 864
  • 1
  • 11
  • 25
18

We can use sub

sub("_3.*", "", df1[,1])
#[1] "col1" "col2" "col3"
akrun
  • 874,273
  • 37
  • 540
  • 662
5

We can try the str_extract with regular expression pattern "^[^_]+(?=_)":

stringr::str_extract(c("col1_3x_xxx", "col2_3y_xyz", "col3_3z_zyx"), "^[^_]+(?=_)")
[1] "col1" "col2" "col3"

where in the pattern:

The first ^ matches the beginning of the string; [^_]+ matches one or more non _ character, ^_ means any character but _. (?=...) stands for lookahead, so we are looking for pattern ahead of _.

Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 2
    No need for the non-capture group: `str_extract(cols, "^[^_]+")`. Pretty much the same speed as Gregor's suggestion: `sub(pattern = "_3.*", replacement = "", x = cols, perl = TRUE)` – Jota Jun 14 '16 at 03:16
2

You can use

names(df) = gsub(pattern = "_3.*", replacement = "", x = names(df))