0

I have a column of strings that's like this:

|Image
|---
|CR 00_01_01
|SF 45_04_07
|ect

I want to get an end result of this:

| Condition | Time |
| ---       | ---  |
| CR        | 00   |

I have 2 steps of doing this but it's very cumbersome. Essentially, I split the string twice first using space and second using _.

df <- df[, c("Condition","T") := tstrsplit(Image, " ", fixed=T)]
df <- df[, c("Time") := tstrsplit(T, "_", fixed=TRUE, keep = 1L)]

Is there any better way of doing this?

CeC
  • 85
  • 10
  • What format do you want your output?Also, will the `Image` column always have that structure (i.e., two letters, a space, and three pairs of numbers separated by an underscore)? – Andrew May 16 '19 at 16:29

2 Answers2

1

Here is a strsplit solution that sounds like it is what you are looking for. Split based on space or underscore and select first two elements.

split_string <- strsplit(df1$Image, split = "\\s|_")

data.frame(Condition = sapply(split_string, `[`, 1),
           Time = sapply(split_string, `[`, 2))

  Condition Time
1        CR   00
2        SF   45

If the format of the Image column is always the same, you could extract based on position.

data.frame(Condition = substr(df1$Image, 1, 2),
           Time = substr(df1$Image, 4, 5))

  Condition Time
1        CR   00
2        SF   45

Or you could just use regex to extract the letters / first pair of numbers.

data.frame(Condition = gsub("^([[:alpha:]]+).*", "\\1", df1$Image),
           Time = gsub(".*[[:space:]]([[:digit:]]+)_.*", "\\1", df1$Image))

  Condition Time
1        CR   00
2        SF   45

Data:

df1 <- data.frame(Image = c("CR 00_01_01", "SF 45_04_07"), stringsAsFactors = F)
Andrew
  • 5,028
  • 2
  • 11
  • 21
  • 1
    Seems like they want to end up with a data frame (or data frames? It's unclear) – camille May 16 '19 at 16:25
  • I start with a big data.frame and Image is one of the column. I'd like to have the resulting 2 columns to be attached to the original data.frame perhaps through cbind(). Thank you for your answer! The second extraction code was perfect for my code. – CeC May 16 '19 at 18:13
  • Glad it helped!! In the future, it is a good practice to post a reproducible example and your expected output. Good luck! – Andrew May 16 '19 at 18:25
  • 1
    What's the best way to post a reproducible example? Is there any way for me to post the csv/txt file? It's a huge dataset and I wasn't sure how to show the relevant part to my question. Again, thank you for your generous help. My code looks much less cluttered now! – CeC May 16 '19 at 18:34
  • @CeC, great question! There is a [really helpful post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on SO that outline creating reproducible examples in R. There is a preference for an example that people can copy / paste into their console over a link to a file or a picture. E.g., if dataframe is large you could always do something like `dput(head(df))`. – Andrew May 16 '19 at 20:07
1

You can try this using dplyr and tidyr

df%>%separate(image,c("Image","Time")," ")%>%
  mutate(Time=sub("([0-9]+).*","\\1",Time))

  Image Time
1    CR   00
2    SF   45

Data

structure(list(image = c("CR 00_01_01", "SF 45_04_07")), class = "data.frame", row.names = c(NA, 
-2L))
boski
  • 2,437
  • 1
  • 14
  • 30
  • You can use just `df %>% separate(image, c("image", "time"))`. – tmfmnk May 16 '19 at 16:38
  • Thank you for both of your comment! Is there any website/instructive channel where I can go to understand more about the "\\" syntax? I tried to search through the answers and greb and regex came up but there weren't many comprehensive answers for someone who's not familiar with it. – CeC May 16 '19 at 18:17
  • @tmfmnk your code does not fully answer the question. @CeC what is going on in the `sub` is the following: grab the occurrence of digits `[0-9]`, as many as there are `+` and call that group number one `( )`. Then grab the rest `.*`. I then sub all of this for just the first group using `\\1`. – boski May 16 '19 at 21:49
  • Thank you! Is this how regex is organized? Also, could you explain what -2L means in row.names = c(NA, -2L)? – CeC May 17 '19 at 12:08
  • @CeC that is just the output of doing `dput()` on the data. If you were to put on your `R` console `df = structure(list( ...)` youd be working with the same data as me :) – boski May 17 '19 at 12:47