Extracting paragraphs within section headings

Question

I have text (read in through readtext) that looks like this:

First Summary of Lorem Ipsum

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

Second Summary of Lorem Ipsum

It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

I would like to extract the two sections individually, without their section titles, and save them as two different character strings in R so I can convert them back to separate .txt files.

Any efforts so far ? what are the rules for a string to be a valid title ? — Code Maniac, Oct 23 '19 at 17:10
How do you identify a header vs the paragraph? Can multiple paragraphs follow a header? If it's constant, you could simply split your document on `(?:\r\n|[\r\n])[ \t]*(?:\r\n|[\r\n])` and extract every second result (positions 0,2,4,6,... in the array) — ctwheels, Oct 23 '19 at 17:10
This question has already been asked several times on SO. For example, [here](https://stackoverflow.com/q/51815205/3277821), [here](https://stackoverflow.com/q/39926993/3277821), and [here](https://stackoverflow.com/q/40479496/3277821). — sboysel, Oct 23 '19 at 17:14
I'm wanting to identify the header by the name of the header (since there are only a few). Multiple paragraphs can follow the header, but they have different lengths of paragraphs. — Mitch Pudil, Oct 23 '19 at 17:15
@sboysel those aren't the best examples for this user's question. — ctwheels, Oct 23 '19 at 17:15
@MitchPudil **how** do you identify the header? We don't have the same knowledge of your problem that you have, so it's hard to say what you need when you haven't identified the formats, required information for us to answer, and problem you're experiencing. — ctwheels, Oct 23 '19 at 17:15
The header appears as part of the string, just like the paragraphs themselves. The only difference is the actual title, which can be multiple words long. — Mitch Pudil, Oct 23 '19 at 17:18
@MitchPudil that doesn't help me identify a header though, there must be some sort of rules, or a list variable with all your headers in it, something for us to identify headers. Right now, the only way I can truly say identifies a header is the fact that it's the 0th and 2nd sentences in the text you posted, or that a paragraph ends with a `.` when a header does not. Regex is a set of rules, but we can't help you with it since only you know the format you need. We can't even begin to generate a *correct* regex pattern without the rules by which it must abide. — ctwheels, Oct 23 '19 at 17:21
I could have a list of variables with all my headers in it. Let's say I have `titles <- c("First Summary of Lorem Ipsum", "Second Summary of Lorem Ipsum")` — Mitch Pudil, Oct 23 '19 at 17:24
@MitchPudil Then in that case you might as well use one of the existing solutions, splitting the text into paragraphs and headers by line breaks, and filter the titles out of your result using the `titles` vector — sboysel, Oct 23 '19 at 17:34

ctwheels · Accepted Answer · 2019-10-23T17:47:50.093

You can split your strings using regex (with strsplit), then use setdiff to remove similarities between titles and the result for strsplit.

See code in use here

titles <- list("First Summary of Lorem Ipsum", "Second Summary of Lorem Ipsum")

s <- "First Summary of Lorem Ipsum

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

Second Summary of Lorem Ipsum

It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

a <- unlist(strsplit(s, "\\h*\\R\\h*\\R\\h*", perl=T))
setdiff(a, titles)

The above results in:

[1] "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book."                                                                                   
[2] "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

An explanation of the regex above \\h*\\R\\h*\\R\\h*. I removed the double backslashes below for simplicity sake (it's only a character escape in R):

\h Matches horizontal whitespace
* Quantifies the previous token (in above regex \h) to match it zero or more times
\R Matches any Unicode newline sequence (\r\n or \r or \n)

The regex matches two newlines (with any number of horizontal whitespace in or surrounding them just in case the input has something like \r\n\t\r\n).

The non-Perl equivalent of this would be:

[ \\t]*(?:\\r\\n|[\\r\\n])[ \\t]*(?:\\r\\n|[\\r\\n])[ \\t]*

Extracting paragraphs within section headings

1 Answers1