5

How can I split this

 Chr3:153922357-153944632(-)
 Chr11:70010183-70015411(-)   

in to

    Chr3  153922357 153944632 - 
    Chr11 70010183  70015411  -   

I tried strsplit(df$V1,"[[:punct:]]")), but the negative sign is not coming in the final result

zx8754
  • 52,746
  • 12
  • 114
  • 209
Kryo
  • 921
  • 9
  • 24
  • (I think) I accidentally deleted a valid post below (not mine)! My deepest apologies! I can't remember who's it was. I flagged this for the mod's attention to undelete. – Maurits Evers Dec 12 '17 at 13:33
  • You want to split only on the first occurrence of `-`. Perhaps some of the answers [here](https://stackoverflow.com/questions/26246095/r-strsplit-on-first-instance) may help – KenHBS Dec 12 '17 at 13:33

3 Answers3

2

The issue is that - is both a character you want to extract and a delimiter. Your best bet is using capture groups and specifying the full regex string:

stringr::str_match(x, "^(.{4}):(\\d+)-(\\d+)\\((.)\\)$")

EDIT: If you want to let the first capture group capture strings of arbitrary length (e.g. ChrX for any X), you can change first capture group from .{4} to Chr\\d+.

Alex Gold
  • 335
  • 1
  • 9
2

How about this in base R using stringsplit and gsub:

# Your sample strings
ss <- c("Chr3:153922357-153944632(-)",
        "Chr11:70010183-70015411(-)")

# Split items as list of vectors 
lst <- lapply(ss, function(x)
    unlist(strsplit(gsub("(.+):(\\d+)-(\\d+)\\((.)\\)", "\\1,\\2,\\3,\\4", x), ",")))


# rbind to dataframe if necessary
do.call(rbind, lst);
#    [,1]    [,2]        [,3]        [,4]
#[1,] "Chr3"  "153922357" "153944632" "-"
#[2,] "Chr11" "70010183"  "70015411"  "-"

This should work for other chromosome names and positive strand features as well.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
2

You can also try str_split from stringr:

library(stringr)
lapply(str_split(df$V1, "(?<!\\()\\-|[:\\)\\(]"), function(x) x[x != ""])

Result:

[[1]]
[1] "Chr3"      "153922357" "153944632" "-"        

[[2]]
[1] "Chr11"    "70010183" "70015411" "-"

Data:

df = read.table(text = " Chr3:153922357-153944632(-)
 Chr11:70010183-70015411(-) ")
acylam
  • 18,231
  • 5
  • 36
  • 45