1

I have a question similar to the one asked here: r Remove parts of column name after certain characters however I have a slight wrinkle. My column titles have formats sich as ENSG00000124564.16 and ENSG00000257509.1, however I want to remove all characters after the .

I cannot just remove the last x characters, as the column titles vary in the number of characters after the . symbol

If I follow the sub() command in the previous question, like here: sub(".*", "", colnames(dataset[6:ncol(dataset)])), it does nothing. I assume because in the normal command the . symbol is used to seperate the string you are searching for and the * symbol to represent anything after it.

How do I alter the code to use . as the string search symbol? This is probably a very simple question.

Phil D
  • 183
  • 10
  • 1
    `.` in regex is a special character that means any character. If you want to match it literally, you can escape it with `\\.` or `[.]`, or set `fixed = TRUE` to not use regex – camille Jan 16 '20 at 16:32
  • 1
    Does this answer your question? [R grep to match dot](https://stackoverflow.com/questions/32916884/r-grep-to-match-dot) – camille Jan 16 '20 at 16:36
  • Also I think you're misunderstanding the regex you're using: `*` matches zero or more occurrences of a character/pattern. So `"a*"` would match "abc", "aaa", and "xyz", because you're not requiring that `"a"` actually be there in order to match. Maybe you're confusing regex with globbing? – camille Jan 16 '20 at 16:39
  • Thanks, I had tried to use `[]` to seperate out the `.` symbol, but it still wouldn't work. I was comitted to modifying the column titles directly before, as I outlined in my response to sm925 below, but it wouldn't work. I managed to get it working by creating a list of my column names first, then modifying them using `sub()` or `gsub()`, and then replacing the column titles with the modified list. I still don't understand why it wouldn't work directly. – Phil D Jan 16 '20 at 16:50
  • Maybe if you included your code & data (see [reproducible example guidance](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)) it would be more clear what isn't working—I'm not sure what you mean by modifying directly. Calling `names` returns a vector; to change those names you need to assign a vector back to `names` – camille Jan 16 '20 at 17:54

2 Answers2

5

You can escape period like this \\.:

x <- "ENSG00000124564.16"
sub("\\..*", "", x)
#[1] "ENSG00000124564"

update:

## if you have list of strings it works
x <- c("ENSG00000124564.16",  "ENSG00000257509.1")
sub("\\..*", "", x)
# [1] "ENSG00000124564" "ENSG00000257509"

## if you want to try it to change the column names it works
df <- data.frame(ENSG00000124564.16 = c(1, 2, 3), ENSG00000257509.1 = c(1, 1, 1))
names(df) <- sub("\\..*", "", names(df))
#  ENSG00000124564 ENSG00000257509
#1               1               1
#2               2               1
#3               3               1
sm925
  • 2,648
  • 1
  • 16
  • 28
  • This worked when I tried it on a single string `x <- "ENSG00000124564.16"`, and when I tried it on a list of strings`x <- colnames(dataset)`, but it wouldn't work when I tried to change the column names directly `colnames(dataset) <- gsub("\\..*$", "", colnames(dataset))` The problem is fixed now, as I could just use the list of strings to replace my column names, but still unsure why it wouldn't work directly. – Phil D Jan 16 '20 at 16:47
  • @PhilD I have updated answer. It's working for me. You can add reproducible example of your sample data set for me to figure out why it's not working in that case. – sm925 Jan 16 '20 at 17:04
3

with \\. you indicate a dot. With . you indicate any kind of character. With .* you indicate any kind of character any number of times. With $ you indicate that it is the end of the string. So you can put those together as such:

df <- data.frame(ENSG00000124564.16=c(1,2,3), ENSG00000257509.1=c(4,5,6))
df

colnames(df) <- gsub("\\..*$", "", colnames(df))
df

edit: sm925 was too fast for my slow typing :)

NicolasH2
  • 774
  • 5
  • 20