Split string using regular expressions and store it into data frame

Question

I have a string like this:

Received @ 10/10/2014 02:29:55 a.m.  Changed status: 'processing' @ 10/10/2014 02:40:20 a.m.  Changed status: 'processed' @ 10/10/2014 02:40:24 a.m.

I need to "parse" this string using certain rules:

The first block is the Received date and time
Each block after the first one starts with Changed status: and ends with a date and time
There can be any number of Changed status: blocks (at least 1) and the status can vary

What I need to do is to:

Split the string and put each block into an array.
Example:
[Received @ 10/10/2014 02:29:55 a.m.], [Changed status: 'processing' @ 10/10/2014 02:40:20 a.m.], [Changed status: 'processed' @ 10/10/2014 02:40:24 a.m.]
After each block is split, I need to split each entry in three fields

For the above example, what I need is something like this:

Received       | NULL       | 10/10/2014 02:29:55 am
Changed status | processing | 10/10/2014 02:40:20 am
Changed status | processed  | 10/10/2014 02:40:20 am

I think step two is quite easy (each block can be split using @ and : as separators), but step one is making me pull my hair off. Is there a way to do this kind of thing with Regular Expressions?

I've tried some approaches (like Received|Changed.*[ap].m.), but it doesn't work (the evaluation of the regular expression always returns the full string).

I want to do this in R:

Read the full data table (which has more fields, and the text above is the last one) into a data frame
"Parse" this string and store it into a second data frame

R has built-in support for regular expressions, so that's my fist thought on approaching the solution.

Any help will be appreciated. Honestly, I'm lost here (but I'll keep on trying... I'll edit my post if I find steps that bring me closer to the solution)

@DaaaahWhoosh I plan to do this in R (already said that in the post... and in the tags). But I can use any other RegEx capable language (e.g. `awk`). I'd like to do this directly in R because if I do it with some other utility I'll have to import the result to R anyway — Barranka, Nov 20 '14 at 22:27
are the received and changed statuses tab delimited? and everything else is single space separated? — rawr, Nov 20 '14 at 22:35

Rich Scriven · Accepted Answer · 2014-11-20T23:02:03.107

Here's a possibility that you could put into a function. In the string you posted, the important information seems to be separated by two spaces, which is nice. Basically what I did was try to get all the relevant lines to split evenly into the right length.

x <- "Received @ 10/10/2014 02:29:55 a.m.  Changed status: 'processing' @ 10/10/2014 02:40:20 a.m.  Changed status: 'processed' @ 10/10/2014 02:40:24 a.m."

s <- strsplit(gsub("['.]", "", x), "  ")[[1]]
s[g] <- sub("(\\D) ", "\\1:  ", s[g <- grep("Received", s)])
do.call(rbind, strsplit(s, " @ |: "))
#      [,1]             [,2]         [,3]                      
# [1,] "Received"       ""           "10/10/2014 02:29:55 am"
# [2,] "Changed status" "processing" "10/10/2014 02:40:20 am"
# [3,] "Changed status" "processed"  "10/10/2014 02:40:24 am"

I went without "NULL" because I presume you meant you wanted an empty character there. NULL would not show up in a data frame anyway.

The trick is to get rid of all the non-important stuff before you split — Rich Scriven, Nov 20 '14 at 22:49

G. Grothendieck · Answer 2 · 2014-11-21T05:13:06.913

Here is a short solution based on strapplyc. strapplyc matches the regular expression to the input string s extracting the matches to the parenthesized portions of the regular expression except the (?:...) which is non-capturing.

There are 3 capturing pairs of parentheses in pat. The first one matches Recieved or Changed status. Then we optionally match a colon, space, single quote, zero or more non-single-quote characters and another quote. The portion between the two quotes is the second captured string. Then we match space, @, space and the date/time string. The date/time string is captured.

Finally matrix is used to reshape it into 3 columns:

library(gsubfn)

pat <- "(Received|Changed status)(?:: '([^']*)')? @ (../../.... ..:..:.. ....)"
matrix(strapplyc(s, pat, simplify = TRUE), nc = 3, byrow = TRUE)

giving:

     [,1]             [,2]         [,3]                      
[1,] "Received"       ""           "10/10/2014 02:29:55 a.m."
[2,] "Changed status" "processing" "10/10/2014 02:40:20 a.m."
[3,] "Changed status" "processed"  "10/10/2014 02:40:24 a.m."

Update: Simplification. Also modified output to be as in question.

score 2 · Answer 3 · answered Nov 20 '14 at 22:43

tmp <- "Received @ 10/10/2014 02:29:55 a.m.  Changed status: 'processing' @ 10/10/2014 02:40:20 a.m.  Changed status: 'processed' @ 10/10/2014 02:40:24 a.m."


tmp1 <- strsplit(gsub('Received', 'Received:', tmp), '\\s{2}', perl = TRUE)

do.call(rbind, strsplit(tmp1[[1]], '@ |: '))

#                 [,1]             [,2]            [,3]                      
# [1,] "Received"       ""              "10/10/2014 02:29:55 a.m."
# [2,] "Changed status" "'processing' " "10/10/2014 02:40:20 a.m."
# [3,] "Changed status" "'processed' "  "10/10/2014 02:40:24 a.m."

score 1 · Answer 4 · answered Nov 20 '14 at 22:43

I'm assuming that you've got your data in a data.frame and that you want to do this on many rows in your data frame. I'm calling that data.frame "Data", and here's what I would do, although perhaps someone else could make this more elegant:

 Split <- str_split(Data$String, "@") # Make a list with your string split by "@"

 Data$Received <- NA
 Data$Processing <- NA
 Data$Processed <- NA

 for (i in 1:nrow(Data)){
       Data$Received[i] <- str_sub(Split[[i]][2], 2, 24) # Extract the date received, etc.
       Data$Processing[i] <- str_sub(Split[[i]][3], 2, 24)
       Data$Processed[i] <- str_sub(Split[[i]][4], 2, 24)
 }

 Data$Received <- mdy_hms(Data$Received) # Use lubridate to convert it to POSIX format
 Data$Processing <- mdy_hms(Data$Processing)
 Data$Processed <- mdy_hms(Data$Processed)

That gives you three columns for the dates and times you want.

Split string using regular expressions and store it into data frame

4 Answers4

Linked