r Remove parts of column name after certain characters

Question

I have a large data set with thousands of columns. The column names include various unwanted characters as follows:

col1_3x_xxx
col2_3y_xyz
col3_3z_zyx

I would like to remove all character strings starting with "_3" from all column names to be left with clean:

col1
col2
col3

What is the most efficient way to do this for 5000+ columns?

`names(your_data) = gsub(pattern = "_3*", replacement = "", x = names(your_data))` — Gregor Thomas, Jun 13 '16 at 23:29
Also, please don't use the RStudio tag unless your question concerns RStudio. (You wouldn't use a Microsoft Word tag for a grammar question just because you're using Word to write something.) — Gregor Thomas, Jun 13 '16 at 23:31
You can also use: `sapply(strsplit(names(df), "_3"), \`[[\`, 1)`. — JasonWang, Jun 13 '16 at 23:33
@Gregor I think you need a `.` in `"_3.*"` or else you're looking for a 3 repeated 0 or more times, and @pynewbie if you're look for efficiency with 5000+ columns, there are _milliseconds_ to be gained by adding `perl = TRUE` — Jota, Jun 14 '16 at 00:21

score 19 · Answer 1 · answered Nov 12 '19 at 01:39

certainly late for this answer, but just in case someone is looking for a solution

colnames(df1)[col] <-  sub("_3.*", "", colnames(df1)[col])

And if you have multiple columns :

for ( col in 1:ncol(df1)){
    colnames(df1)[col] <-  sub("_3.*", "", colnames(df1)[col])
}

score 18 · Accepted Answer · answered Jun 14 '16 at 02:13

18

We can use sub

sub("_3.*", "", df1[,1])
#[1] "col1" "col2" "col3"

answered Jun 14 '16 at 02:13

akrun

874,273
37
540
662

3

I had success by replacing the object "df1[,1]" with "colnames(df1)". – Todd D Aug 15 '18 at 18:00
1

@ToddD. Good to know. I think the OP wanted to do that after we get the substring – akrun Aug 16 '18 at 03:58
2

is there a possible `dplyr` version of this? – dre Jan 05 '19 at 03:26
1

@dre Use `strr_remove` i.e. `df1 %>% mutate(col = str_remove(col, '_3.*'))` – akrun Jan 05 '19 at 07:25

score 5 · Answer 3 · answered Jun 14 '16 at 00:36

5

We can try the str_extract with regular expression pattern "^[^_]+(?=_)":

stringr::str_extract(c("col1_3x_xxx", "col2_3y_xyz", "col3_3z_zyx"), "^[^_]+(?=_)")
[1] "col1" "col2" "col3"

where in the pattern:

The first ^ matches the beginning of the string; [^_]+ matches one or more non _ character, ^_ means any character but _. (?=...) stands for lookahead, so we are looking for pattern ahead of _.

answered Jun 14 '16 at 00:36

Psidom

209,562
33
339
356

2

No need for the non-capture group: `str_extract(cols, "^[^_]+")`. Pretty much the same speed as Gregor's suggestion: `sub(pattern = "_3.*", replacement = "", x = cols, perl = TRUE)` – Jota Jun 14 '16 at 03:16

score 2 · Answer 4 · 2023-01-14T04:35:38.203

2

You can use

names(df) = gsub(pattern = "_3.*", replacement = "", x = names(df))

edited Jan 14 '23 at 04:35

answered Apr 29 '22 at 05:15

r Remove parts of column name after certain characters

4 Answers4

Linked