extract the first part of each string in a data frame in r

Question

I have a data frame M. I would like to extract the first part of each string separated by ":". I used strsplit but the result is a large character not a data frame. Could someone please help with this?

M <- read.table(text=
"1/1:205,54,0:18:0:57 1/1:141,39,0:13:0:42   0/0:0,54,255:18:0:45 1/1:174,48,0:16:0:51 0/0:0,84,255:28:0:75 
 0/0:0,78,255:26:0:99 0/0:0,63,255:21:0:86   0/0:0,45,255:15:0:68 0/0:0,48,255:16:0:71 0/0:0,132,255:44:0:99
 0/0:0,78,255:26:0:89 0/0:0,78,255:26:0:89   0/0:0,36,255:12:0:47 0/0:0,33,255:11:0:44 0/0:0,108,255:36:0:99
 0/0:0,75,255:25:0:99 0/0:0,54,255:18:0:78   0/0:0,69,255:23:0:93 0/0:0,33,255:11:0:57 0/0:0,96,255:32:0:99 
 0/0:0,60,75:21:0:74  0/0:0,51,84:17:0:65    0/0:0,48,64:17:0:62  0/0:0,42,65:15:0:56  0/0:0,84,99:28:0:98 ",
head=F, stringsAsFactors=F)
S <- sapply(strsplit(M, ":"), "[", 1)

`sapply(M, function(x) sapply(strsplit(as.character(x), ':'),'[',1))` — akrun, Jul 27 '15 at 18:02
You should do an SO search on: `[r] extract the first part of each string`. I get 15 hits. One of those is almost surely a duplicate. — IRTFM, Jul 27 '15 at 18:07
Also related: [Splitting a dataframe string column into multiple different columns](http://stackoverflow.com/questions/18641951/splitting-a-dataframe-string-column-into-multiple-different-columns) and e.g. `splitstackshape::cSplit` — smci, Jul 27 '15 at 18:12

akrun · Accepted Answer · 2015-07-27T19:03:40.023

5

It may not be best to use strsplit as we are only interested in a substring. Assuming that the OP is interested in understanding how strsplit can be used for this example dataset, a modification of the OP's code would be to use a nested lapply/sapply loop.

 M[] <- lapply(M, function(x) sapply(strsplit(as.character(x), ':'),'[',1))
 M
 #   V1  V2  V3  V4  V5
 #1 1/1 1/1 0/0 1/1 0/0
 #2 0/0 0/0 0/0 0/0 0/0
 #3 0/0 0/0 0/0 0/0 0/0
 #4 0/0 0/0 0/0 0/0 0/0
 #5 0/0 0/0 0/0 0/0 0/0

Or as the columns are all similar, we can unlist, use strsplit and assign the original dataset with the output so that we can keep the original structure intact for the output we got.

  M[] <- sapply(strsplit(unlist(M), ':'),'[',1)

Or a faster option would be using stri_extract_first from stringi to extract the the characters that are not :.

  library(stringi)
  M[] <- stri_extract_first(unlist(M), regex='[^:]+')

edited Jul 27 '15 at 19:03

answered Jul 27 '15 at 18:12

akrun

874,273
37
540
662

option1 seems need long time. option2: gave a short time but the result data str is still a large character not the dataframe or matrix – user3354212 Jul 27 '15 at 18:31
1

@user3354212 The result will be a vector. But when you assign it to `M[] <-...` it will be a data.frame with the original structure intact – akrun Jul 27 '15 at 18:34
I was wrong comment with the data str that I looked a different data. I like option 3, that only used 8.14 seconds, option2 used 67.73 sec. The answer 3 From Richard Scriven used 155.05 sec for my real data. Thanks. – user3354212 Jul 27 '15 at 18:48
@user3354212 Did you meant the option 1 used `155.05 sec`. It makes sense as we are using a nested loop there. `stringi` methods should be very fast (option 3) – akrun Jul 27 '15 at 18:49
Option1 is lapply(M, function(x) sapply(strsplit(as.character(x), ':'),'[',1)) – user3354212 Jul 27 '15 at 18:51
@user3354212 Yes, I would expect that to be slower as it involves nested loops. – akrun Jul 27 '15 at 18:52

Steven Beaupré · Answer 2 · 2015-07-27T18:20:48.823

4

Try:

dplyr::mutate_each(M, funs(sub("(.*?)(:.*)", "\\1" , .)))

Which gives:

#   V1  V2  V3  V4  V5
#1 1/1 1/1 0/0 1/1 0/0
#2 0/0 0/0 0/0 0/0 0/0
#3 0/0 0/0 0/0 0/0 0/0
#4 0/0 0/0 0/0 0/0 0/0
#5 0/0 0/0 0/0 0/0 0/0

edited Jul 27 '15 at 18:20

answered Jul 27 '15 at 18:06

Steven Beaupré

21,343
7
57
77

there is an error with your code, > dplyr::mutate_each(M, funs(sub("(.*?)(:.*)", "\\1" , .))) Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "c('matrix', 'character')" – user3354212 Jul 27 '15 at 18:41
What does `class(M)` returns ? From your question, I assumed it was a `data.frame` – Steven Beaupré Jul 27 '15 at 18:42
1

M is a large matrix. I tested M as data frame. it works. it used 20.49 sec for my real data. pretty good! – user3354212 Jul 27 '15 at 19:02

Rich Scriven · Answer 3 · 2015-07-27T18:15:22.007

You can use sub()

M[] <- lapply(M, sub, pattern = ":.*", replacement = "")
M
#    V1  V2  V3  V4  V5
# 1 1/1 1/1 0/0 1/1 0/0
# 2 0/0 0/0 0/0 0/0 0/0
# 3 0/0 0/0 0/0 0/0 0/0
# 4 0/0 0/0 0/0 0/0 0/0
# 5 0/0 0/0 0/0 0/0 0/0

The above will overwrite the original M data. If you do not wish to overwrite M, assign it to a new variable name first or just use as.data.frame() around lapply()

as.data.frame(lapply(M, sub, pattern = ":.*", replacement = ""))

extract the first part of each string in a data frame in r

3 Answers3