2

I have a data frame M. I would like to extract the first part of each string separated by ":". I used strsplit but the result is a large character not a data frame. Could someone please help with this?

M <- read.table(text=
"1/1:205,54,0:18:0:57 1/1:141,39,0:13:0:42   0/0:0,54,255:18:0:45 1/1:174,48,0:16:0:51 0/0:0,84,255:28:0:75 
 0/0:0,78,255:26:0:99 0/0:0,63,255:21:0:86   0/0:0,45,255:15:0:68 0/0:0,48,255:16:0:71 0/0:0,132,255:44:0:99
 0/0:0,78,255:26:0:89 0/0:0,78,255:26:0:89   0/0:0,36,255:12:0:47 0/0:0,33,255:11:0:44 0/0:0,108,255:36:0:99
 0/0:0,75,255:25:0:99 0/0:0,54,255:18:0:78   0/0:0,69,255:23:0:93 0/0:0,33,255:11:0:57 0/0:0,96,255:32:0:99 
 0/0:0,60,75:21:0:74  0/0:0,51,84:17:0:65    0/0:0,48,64:17:0:62  0/0:0,42,65:15:0:56  0/0:0,84,99:28:0:98 ",
head=F, stringsAsFactors=F)
S <- sapply(strsplit(M, ":"), "[", 1)
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
user3354212
  • 1,048
  • 8
  • 19

3 Answers3

5

It may not be best to use strsplit as we are only interested in a substring. Assuming that the OP is interested in understanding how strsplit can be used for this example dataset, a modification of the OP's code would be to use a nested lapply/sapply loop.

 M[] <- lapply(M, function(x) sapply(strsplit(as.character(x), ':'),'[',1))
 M
 #   V1  V2  V3  V4  V5
 #1 1/1 1/1 0/0 1/1 0/0
 #2 0/0 0/0 0/0 0/0 0/0
 #3 0/0 0/0 0/0 0/0 0/0
 #4 0/0 0/0 0/0 0/0 0/0
 #5 0/0 0/0 0/0 0/0 0/0

Or as the columns are all similar, we can unlist, use strsplit and assign the original dataset with the output so that we can keep the original structure intact for the output we got.

  M[] <- sapply(strsplit(unlist(M), ':'),'[',1)

Or a faster option would be using stri_extract_first from stringi to extract the the characters that are not :.

  library(stringi)
  M[] <- stri_extract_first(unlist(M), regex='[^:]+')
akrun
  • 874,273
  • 37
  • 540
  • 662
  • option1 seems need long time. option2: gave a short time but the result data str is still a large character not the dataframe or matrix – user3354212 Jul 27 '15 at 18:31
  • 1
    @user3354212 The result will be a vector. But when you assign it to `M[] <-...` it will be a data.frame with the original structure intact – akrun Jul 27 '15 at 18:34
  • I was wrong comment with the data str that I looked a different data. I like option 3, that only used 8.14 seconds, option2 used 67.73 sec. The answer 3 From Richard Scriven used 155.05 sec for my real data. Thanks. – user3354212 Jul 27 '15 at 18:48
  • @user3354212 Did you meant the option 1 used `155.05 sec`. It makes sense as we are using a nested loop there. `stringi` methods should be very fast (option 3) – akrun Jul 27 '15 at 18:49
  • Option1 is lapply(M, function(x) sapply(strsplit(as.character(x), ':'),'[',1)) – user3354212 Jul 27 '15 at 18:51
  • @user3354212 Yes, I would expect that to be slower as it involves nested loops. – akrun Jul 27 '15 at 18:52
4

Try:

dplyr::mutate_each(M, funs(sub("(.*?)(:.*)", "\\1" , .)))

Which gives:

#   V1  V2  V3  V4  V5
#1 1/1 1/1 0/0 1/1 0/0
#2 0/0 0/0 0/0 0/0 0/0
#3 0/0 0/0 0/0 0/0 0/0
#4 0/0 0/0 0/0 0/0 0/0
#5 0/0 0/0 0/0 0/0 0/0
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • there is an error with your code, > dplyr::mutate_each(M, funs(sub("(.*?)(:.*)", "\\1" , .))) Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "c('matrix', 'character')" – user3354212 Jul 27 '15 at 18:41
  • What does `class(M)` returns ? From your question, I assumed it was a `data.frame` – Steven Beaupré Jul 27 '15 at 18:42
  • 1
    M is a large matrix. I tested M as data frame. it works. it used 20.49 sec for my real data. pretty good! – user3354212 Jul 27 '15 at 19:02
4

You can use sub()

M[] <- lapply(M, sub, pattern = ":.*", replacement = "")
M
#    V1  V2  V3  V4  V5
# 1 1/1 1/1 0/0 1/1 0/0
# 2 0/0 0/0 0/0 0/0 0/0
# 3 0/0 0/0 0/0 0/0 0/0
# 4 0/0 0/0 0/0 0/0 0/0
# 5 0/0 0/0 0/0 0/0 0/0

The above will overwrite the original M data. If you do not wish to overwrite M, assign it to a new variable name first or just use as.data.frame() around lapply()

as.data.frame(lapply(M, sub, pattern = ":.*", replacement = ""))
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245