4

I have utterances with annotation symbols:

utt <- c("↑hey girls↑ can I <join yo:u>", "((v: grunts))", "!damn shit! got it", 
"I mean /yeah we saw each other at a party:/↓ the other day"
)

I need to split utt into separate words unless the words are enclosed by certain delimiters, including this class [(/≈↑£<>°!]. I'm doing reasonably well using double negative lookahead for utts where only one such string between delimiters occurs; but I'm failing to split correctly where there are multiple such strings between delimiters:

library(tidyr)
library(dplyr)
data.frame(utt2) %>%
  separate_rows(utt, sep = "(?!.*[(/≈↑£<>°!].*)\\s(?!.*[)/≈↑£<>°!])")
# A tibble: 9 × 1
  utt2                                        
  <chr>                                       
1 ↑hey girls↑ can I <join yo:u>               
2 ((v: grunts))                               
3 !damn shit!                                 
4 got                                         
5 it                                          
6 I mean /yeah we saw each other at a party:/↓
7 the                                         
8 other                                       
9 day 

The expected result would be:

1 ↑hey girls↑ 
2 can
3 I
4 <join yo:u>               
5 ((v: grunts))                               
6 !damn shit!                                 
7 got                                         
8 it                                          
9 I
10 mean 
11 /yeah we saw each other at a party:/↓
12 the                                         
13 other                                       
14 day 
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34

1 Answers1

5

You can use

data.frame(utt2) %>% separate_rows(utt2, sep = "(?:([/≈↓£°!↑]).*?\\1|\\([^()]*\\)|<[^<>]*>)(*SKIP)(*F)|\\s+")

See the regex demo.

Note that in your case, there are chars that are paired (like ( and ), < and >) and non-paired chars (like , £). They require different handling reflected in the pattern.

Details:

  • (?:([/≈↓£°!↑]).*?\\1|\\([^()]*\\)|<[^<>]*>)(*SKIP)(*F) matches
    • ([/≈↓£°!↑]).*?\1| - a /, , , £, ° or ! char captured into Group 1, then any zero or more chars other than line break chars as few as possible (see .*?) and then the same char as captured into Group 1
    • \([^()]*\)| - (, zero or more chars other than ( and ) and then a ) char, or
    • <[^<>]*> - <, zero or more chars other than < and > and then a > char
    • (*SKIP)(*F) - skip the matched text and restart a new search from the failure position
  • | - or
  • \s+ - one or more whitespaces in any other context.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thanks a lot - great job (as always). I'm still not familiar enough with the SKIP and FAIL syntax. Is there a website with explanations that you'd recommend? – Chris Ruehlemann Dec 01 '21 at 10:17
  • 1
    @ChrisRuehlemann See [How do (*SKIP) or (*F) work on regex?](https://stackoverflow.com/q/24534782/3832970). If you still have doubts, you may drop a comment here, too. – Wiktor Stribiżew Dec 01 '21 at 10:23
  • I admit having read the linked post I'm still not fully clear about SKIP and FAIL. Would the above task also be feasible using lookaround? – Chris Ruehlemann Dec 01 '21 at 10:43
  • @ChrisRuehlemann That sounds like a new question :) Matching some pattern not in between two other patterns is not an easy regex task. In general, 1) if the two other patterns are identical single chars, a pattern like [this](https://stackoverflow.com/a/1757107/3832970) can be used (but it is very inefficient), 2) if the patterns are different, a variable width lookbehind is necessary, it will be, say for `<>`, `(?<!<[^<>]*)\s(?![^<>]*>)`. PCRE does not support this and ICU will require limiting quantifiers in the lookbehind with set min and max values. 3) If the patterns are different ... – Wiktor Stribiżew Dec 01 '21 at 10:53
  • @ChrisRuehlemann ...multicharacter strings, tempered greedy token will be necessary with the above regex. However, the pattern for 2) is not precise, it also avoids matching whitespace when it is just preceded with `<` and not followed with `>` and vice versa. `\s(?!(?<=<[^<>]*)[^<>]*>)` would be more precise, but it is so cryptic. – Wiktor Stribiżew Dec 01 '21 at 10:54
  • The regex fails for this string: `"↑!THAT'S! A WOMAN !THAT'S! A WOMAN↑"`, which is split into `↑!THAT'S!`, `A`, `WOMAN`, `!THAT'S!`, `A`, `WOMAN↑` while it should remain unsplitted (as everything is between `↑` and `↑` - Any help with this? – Chris Ruehlemann Dec 01 '21 at 10:59
  • @ChrisRuehlemann Yes, I thought of that. You need to split the chars into paired and non-paired and use `(?:([\/≈↓£°!↑]).*?\1|\([^()]*\)|<[^<>]*>)(*SKIP)(*F)|\s+`, see [demo](https://regex101.com/r/lyvMpF/2). – Wiktor Stribiżew Dec 01 '21 at 11:02