1

I have a dataframe with single column Value:

Value
message accepted
update: message received
new status: user online
no new messages

I want to split this column into two "event" and "message". But not all rows have events, so in those cases there must be NA in "event" column. So desired result is:

event           message
  NA         message accepted
update       message received
new status   user online
  NA         no new messages

How could i do that? I don't really know how to do conditions in regular expressions. I tried this, but it doesn't work:

df %>% 
  tidyr::extract(col = "Value",
                   into = c("event", "message"),
                   regex = "(?: (.*?):)? (?s:(.*))$", remove = FALSE)
french_fries
  • 1,149
  • 6
  • 22

3 Answers3

2

You may use

^(?:(.*?):)?\s*((?s:.*))$

See the regex demo. Details:

  • ^ - start of string
  • (?:(.*?):)? - an optional sequence of
    • (.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
    • : - a colon
  • \s* - 0+ whitespace chars
  • ((?s:.*)) - Group 2: any zero or more chars as many as possible
  • $ - end of string.

R demo:

library(tidyr)
df %>% 
   tidyr::extract(col = "Value",
                    into = c("event", "message"),
                    regex = "^(?:(.*?):)?\\s*(.*)$", remove = FALSE)

Output:

                     Value      event          message
1         message accepted       <NA> message accepted
2 update: message received     update message received
3  new status: user online new status      user online
4          no new messages       <NA>  no new messages
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

You can use tidyr::separate

tidyr::separate(df, Value, c("event", "message"), sep = ":", 
                 extra = "merge", fill = "left", remove = FALSE)

#                     Value      event           message
#1         message accepted       <NA>  message accepted
#2 update: message received     update  message received
#3  new status: user online new status       user online
#4          no new messages       <NA>   no new messages

We use ":" as separator, with extra = "merge" and fill = "left" we tell it to merge the extra pieces and fill the missing values on the left.

data

df <- structure(list(Value = c("message accepted", "update: message received", 
"new status: user online", "no new messages")), 
class = "data.frame", row.names = c(NA, -4L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

Here is a base R option

dfout <- cbind(
  df,
  setNames(data.frame(do.call(rbind, lapply(strsplit(df$Value, ": "), function(x) {
    v <- `length<-`(x, 2)
    c(v[is.na(v)], v[!is.na(v)])
  }))), c("event", "message"))
)

which gives

> dfout
                     Value      event          message
1         message accepted       <NA> message accepted
2 update: message received     update message received
3  new status: user online new status      user online
4          no new messages       <NA>  no new messages
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81