In R, extract text between headings using regular expressions

Question

I want to extract all the text between chapter headings, including the first/opening heading but excluding the closing heading. The headings are always uppercase, always preceded by a digit-period or digit-letter-period combination, and always followed by space/s. I want to keep the subheadings (i.e. "6.1", "7A.1") as part of the extracted string. Here's some example text:

example <- "5. SCOPE This document outlines what to do in case of emergency landing (ignore for non-emergency landings) on tarmac. 6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6.1 Lower the wheel mechanism using the switch labelled 'wheel mechanism'. 7A WARNING 7A.1 Do not forget to warn passengers."

# The output I want is:

"5. SCOPE This document outlines what to do in case of emergency landing (ignore for non-emergency landings) on tarmac."

"6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6.1 Lower the wheel mechanism using the switch labelled 'wheel mechanism'."

"7A WARNING 7A.1 Do not forget to warn passengers."

Using the stringr package, and with help from this post, I got this far:

library(stringr)
str_extract_all(example, "(\\d+\\w?\\.?[:blank:]+[:upper:]+)(.*?)(?=\\d+\\w?\\.?[:blank:]+[:upper:]+)")

# Explanation of my regex code:
# (\\d+\\w?\\.?[[:blank:]]+[[:upper:]])
# \\d+   one or more digits
# \\w?   zero or one letter
# \\.?   zero or one period
# [:blank:]+   one or more space/tab
# [:upper]+    one or more capital letters

# (.*?)   non-greedy capture, zero or one or more of any character

# (?=\\d+\\w?\\.?[:blank:]+[:upper:]+)
# ?=   followed by
# \\d+   one or more digits
# \\w?   zero or one letter
# \\.?   zero or one period
# [:blank:]+   one or more space/tab
# [:upper]+    one or more capital letters

This came pretty close to what I want, with only two things going wrong. The first is that "6.1" it split into "6." and "1". The second is that text after the last chapter heading isn't captured, and looks like it might be getting split the same as "6.1" was:

[[1]]
[1] "5. SCOPE This document outlines what to do in case of emergency landing (ignore for non-emergency landings) on tarmac. "
[2] "6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6."                                  
[3] "1 Lower the wheel mechanism using the switch labelled 'wheel mechanism'. "                                              
[4] "7A WARNING 7A."

Where am I going wrong??

Wiktor Stribiżew · Accepted Answer · 2020-05-28T14:22:52.460

You may use

example <- "5. SCOPE This document outlines what to do in case of emergency landing (ignore for non-emergency landings) on tarmac. 6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6.1 Lower the wheel mechanism using the switch labelled 'wheel mechanism'. 7A WARNING 7A.1 Do not forget to warn passengers."

library(stringr)
str_split(example, "(?!^)(?<!\\d[.A-Z])(?<!\\d[A-Z]\\.)\\b(?=\\d+(?:[a-zA-Z]|\\.)\\s+\\p{Lu})")

Output:

[[1]]
[1] "5. SCOPE This document outlines what to do in case of emergency landing (ignore for non-emergency landings) on tarmac. "                                       
[2] "6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6.1 Lower the wheel mechanism using the 2 Switch labelled 'wheel mechanism'. "
[3] "7A WARNING 7A.1 Do not forget to warn passengers."

See the R demo and the regex demo.

Details

(?!^) - not at the start of the string
(?<!\d[.A-Z]) - not if preceded with digit and a dot or letter
(?<!\d[A-Z]\.) - not if preceded with digit, letter, dot
\b - match a word boundary location that is...
(?=\d+(?:[a-zA-Z]|\.)\s+\p{Lu}) - followed with 1+ digits, and then either a letter or a dot, then 1+ whitespaces and an uppercase letter.

Thank you, almost works! It's still tripping up on my actual data where there are instances of a number-space-uppercase letter, e.g. `6.1 Lower the wheel mechanism using the 2 Switches labelled 'wheel mechanism'.` gets outputted as `6.1 Lower the wheel mechanism using the " "2 Switches labelled 'wheel mechanism'. "` Is it something to do with `\\p{Lu}`? — mendy, May 28 '20 at 14:15
@mendy If there must be a dot or letter after the number use `(?!^)(?<!\d[.A-Z])(?<!\d[A-Z]\.)\b(?=\d+(?:[a-zA-Z]|\.)\s+\p{Lu})`, see [this regex demo](https://regex101.com/r/jXoSji/1) — Wiktor Stribiżew, May 28 '20 at 14:20

Chris Ruehlemann · Answer 2 · 2020-05-28T13:53:43.450

This works too:

str_extract_all(example, "\\d[.A-Z\\d\\s]+[A-Z]{2,}[\\s(.\\w]+")
[[1]]
[1] "5. SCOPE This document outlines what to do in case of emergency landing (ignore for non"                                                    
[2] "6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6.1 Lower the wheel mechanism using the switch labelled "
[3] "7A WARNING 7A.1 Do not forget to warn passengers."

In R, extract text between headings using regular expressions

2 Answers2