1

I am trying to divide the string into three parts: name and time (date, time) and generic texts. It originally looks like:

data = 
c("JENNIFER [Day 1, 9:00 A.M.]: Generic text, it doesn't matter what is going on here. There are more than 2 lines." 

"SAM [Day 2, 10:15 A.M.]: This doesn't matter. It has a lot of lines." 
"DAN'S [Day 4, 12:00 P.M.]: It doesn't really matter what's going on in this part.")

I was able to extract the first portion of the data, NAME [TIME]:, but what I am having hard time doing is to divide NAME and TIME.

match = regexpr("^[A-Z].*:", data)
regmatches(data, match)

This gives me:

JENNIFER [Day 1, 9:00 A.M.]:
SAM [Day 2, 10:15 A.M.]:
DAN'S [Day 4, 12:00 P.M.]:

I can see that names are all in capital letters, so I would say "^[A-Z]", but this would also pick up every other sentences beginning with a capital letter.

I am going to create a data frame:

   Name           Date             Content
JENNIFER     Day 1 9:00A.M    "combined text" 

  • Please make this question *reproducible*. This includes sample data in ab unambiguous format such as `dput(head(x))`). Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. (The suggested format with "blablablabla..." doesn't give enough to work with. Are there quotes around some of those literals?) – r2evans Apr 07 '19 at 23:19
  • Thanks, givemecoffee, I understand better the format of your data with your last edit. That kind of reproducible error-free data is very helpful to remove ambiguity and guessing on our part. – r2evans Apr 07 '19 at 23:35

1 Answers1

2

Fixing up data to make it proper R code as shown in the Note at the end, we can use strcapture from base R like this:

strcapture("^(.*) \\[(.*)\\]: (.*)", data,
  list(Name = character(0), Date = character(0), Text = character(0)))

giving:

      Name              Date                                                  Text
1 JENNIFER  Day 1, 9:00 A.M. Blablablablablablbalbllalbalbalbl. Balalalbablablabl.
2      SAM Day 2, 10:15 A.M.  Balblablablabalbalbalblabalblablabl. Balaldfkemfeke.
3    DAN'S Day 4, 12:00 P.M.                                        DFnerke"dfsdf"

Note

data <-
c('JENNIFER [Day 1, 9:00 A.M.]: Blablablablablablbalbllalbalbalbl. Balalalbablablabl.',
'SAM [Day 2, 10:15 A.M.]: Balblablablabalbalbalblabalblablabl. Balaldfkemfeke.',
'DAN\'S [Day 4, 12:00 P.M.]: DFnerke"dfsdf"')
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341