0

I have a large dataset and I would like to remove characters, starting with e, v, i at the end of a string. My dataset looks like this

P*01:01:05e1
P*01:01:05e2
P*01:01:05e3
P*01:01:05e10
P*02:02v1
P*02:02v2
P*02:01:03v2
P*05:01:01i1
P*05:01:01i8

and I want it to be P*01:01:05, P*02:02, P*02:01:03, P*05:01:01. I first tried removing the 'e' letters using

> xdata$gene <-gsub("e*", "", xdata$gene, perl = TRUE) 

but I get this error message

Error in `$<-.data.frame`(`*tmp*`, "gene", value = character(0)) : 
  replacement has 0 rows, data has 58

It appears I cannot replace 'e' with nothing. Any suggestions?

Data

xdata <- read.table(header = TRUE, stringsAsFactors = FALSE,
                    text = "gene
                    P*01:01:05e1
                    P*01:01:05e2
                    P*01:01:05e3
                    P*01:01:05e10
                    P*02:02v1
                    P*02:02v2
                    P*02:01:03v2
                    P*05:01:01i1
                    P*05:01:01i8")
rawr
  • 20,481
  • 4
  • 44
  • 78
Mona
  • 93
  • 1
  • 10
  • Try `stringr::str_split_fixed(df1$V1, pattern = "e|v|i", n = 2)` – zx8754 Nov 18 '16 at 21:05
  • What about: `strings <- c("P*01:01:05e1", "P*02:01:03v2")` `strings <- chartr("evi", " ", strings)` `gsub(" ", "", strings)` `[1] "P*01:01:051" "P*02:01:032"` – William Nov 18 '16 at 21:08
  • @zx8754 OP wants to remove not split – Sotos Nov 18 '16 at 21:11
  • 1
    @Sotos split then get 1st column? I will leave to community if this needs re-opening. `stringr::str_split_fixed(df1$V1,pattern = "e|v|i", n = 2)[, 1]` – zx8754 Nov 18 '16 at 21:14
  • Yeah I guess thats one way of doing it. So many dupes for these kind of questions – Sotos Nov 18 '16 at 21:16
  • 2
    @Sotos Exactly my point, many many dupes, agreed target is not 100% dupe, but gives enough knowledge to go towards the right solution. – zx8754 Nov 18 '16 at 21:17
  • no one really addressed the error... @Mona I feel like you are misspelling the column name in your `gsub`, for example I get that error if I use `xdata$gene <-gsub("e*", "", xdata$dasdfaldfalasdfasd)` so for the example data, your code runs without error, but as pointed out you probably want `gsub('[evi].*', '', xdata$gene)` instead – rawr Nov 18 '16 at 22:10
  • I spotted the error and edited the formula and it worked. FYI the formula is: > data$B_newY <-gsub("([evi]\\d+)", "", data$B_old, perl = TRUE) – Mona Nov 18 '16 at 22:45
  • Also I have 10 columns of data but only want to apply the formula to 9 columns, any suggestions. – Mona Nov 18 '16 at 22:52
  • 2
    `sub_fun <- function(x) gsub("[eiv].*", "", x); data[, -1] <- lapply(data[, -1], sub_fun)` should work – rawr Nov 18 '16 at 23:32

0 Answers0