4

Based on this question: Split text string in a data.table columns, I was wondering whether there is an efficient method to split the text string conditionally based on the contents of the row.

Suppose I have the following table:

Comments                  Eaten
001 Testing my computer   No
0026 Testing my fridge    No
Testing my car            Yes

and I would to have this:

ID   Comments             Eaten
001  Testing my computer  No
0026 Testing my fridge    No
NA   Testing my car       Yes

Where NA is empty.

Is this possible in data.table?

The comment should have an ID, but since this is optional, I only want to extract the ID's if and only if the comment starts with a number.

Community
  • 1
  • 1
Snowflake
  • 2,869
  • 3
  • 22
  • 44

1 Answers1

7

This could be done using tidyr's extract function which allows you to specify a regex pattern:

tidyr::extract(dt, Comments, c("ID", "Comments"), regex = "^(\\d+)?\\s?(.*)$")
#     ID            Comments Eaten
#1:  001 Testing my computer    No
#2: 0026   Testing my fridge    No
#3:   NA      Testing my car   Yes

You can add the argument convert = TRUE if you want the extracted columns to be converted to a more sensible type.


Another option using only base R and data.table would be

dt[grepl("^\\d+", Comments),                     # check if start with ID (subset)
   `:=`(ID = sub("^(\\d+).*", "\\1",Comments),   # extract ID from comments
        Comments = sub("^(\\d+)", "",Comments))  # delete ID from Comments
]

Though in this case the tidyr syntax seems a little easier to me. There may also be a way using data.table's tstrsplit function with a fancy lookaround regex.

talat
  • 68,970
  • 21
  • 126
  • 157
  • `transpose(regmatches(x, regexec("^(\\d+)? ?(.*)", x)))` or similar, I guess. Not tested since OP's data is not copy-pastable... – Frank May 03 '17 at 13:38
  • R 3.4.0, also, has a `strcapture` function that could fit here -- `strcapture("^(\\d+)?\\s?(.*)$", dt$Comments, data.frame(ID = "", Comment = ""))` – alexis_laz May 03 '17 at 14:05
  • @alexis_laz looks interesting but also a bit strange with those empty strings. I haven't upgraded to 3.4.0 yet – talat May 03 '17 at 14:14