R - Extract and Parse Each Instance of Multi-line Delimited Text by 2 Strings into Individual Rows (.txt to data.frame)

Question

I believe this is a loop and gregexpr() issue. I'm trying to extract/export multi-line text from i number of standardized instances within i number of standardized .txt forms into a data frame where each instance is a separate row. So far, I can successfully extract the string data (though the algorithm extracts a little more than the stated gregexpr() parameters) but can only export as .txt as a lump sum of text.

How can I create a data frame of the extracted txt-files' text where each instance of multi-line text has its own row? (Once the data is in a data.frame format, I know how to export as xlsx from there.)
How can I extract only the data from the parameters I have set?

With help (particularly from Ben from the comments of this post), here is what I have so far:

# Txt Data Format
txt 1 <-
"A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz."
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.

txt 2 <-
"A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz."
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.


#################################
# Directory and Text Extraction #
#################################

dest <- "C:/~"
docs_text <- list.files(path = dest, pattern = "txt",  full.names = TRUE)

## Assumes that all the content I want to extract is between "A." and "C." in 
## the text while ignoring "C." and "D." content.

docs_list <- list.files(path = dest, pattern = "txt",  full.names = TRUE)
docs_doc <- lapply(docs_list, function(i) {
  j <- paste0(scan(i, what = character()), collapse = " ")
  regmatches(j, gregexpr("(?<=A. The First).*?(?=C. The Third)", j, perl=TRUE))
})

lapply(1:length(docs_doc),  function(i) write.table(docs_doc[i], file=paste(docs_list[i], " ", 
" ", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))

Current output looks like this where all of the text is in one line and captures more than just between "A." and "C.":

Desired output would look like this where the multi-line text between any instance of "A." and "C." is extracted and assigned one line each:

Any help you could provide would be tremendously helpful!

I'm ultimately trying to develop an NLP model that can extract standardized form data from hundreds of large PDFs for a year over year repository. If this post suggests I'm not thinking about how to approach this problem efficiently/effectively, I'm open to direction.

Thanks in advance!

r2evans · Answer 1 · 2020-03-17T15:00:28.020

Regex to the rescue.

First, your sample data is malformed, here's usable data.

txt1 <-
"A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz."
vec <- readLines(textConnection(txt1)) # 'textConnection' to read non-file

We first combine everything into one string, then search for (and split on) "A.":

paste("A.", Filter(nzchar, strsplit(paste(vec, collapse = ""), "\\bA\\. ")[[1]]))
# [1] "A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz. C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz. "
# [2] "A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz. C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz."

I'm trying to grab text between two string sequences which I specify. I only want to extract text between "A. The First" and "C. The Third", however many times it may occur in my text file. — Bradley Thomas Anderson, Mar 13 '20 at 20:05
Okay. For clarity, though, you need to provide clear expected output, I don't think the image is clear enough. — r2evans, Mar 13 '20 at 20:17

score 0 · Answer 2 · answered Mar 17 '20 at 14:51

Your question could be a little clearer as I'm not sure if lines starting with "C. The Third:" should be included or not. The solution below stops right before that line:

data

txt1 <-
  "A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz."
vec <- readLines(textConnection(txt1)) # 'textConnection' to read non-file

answer

First I note the line numbers which start with "A. The First" or "C. The Third". I allow some whitespace ("\\s*) between the start of the element (^) and the pattern.

As <- grep("^\\s*A. The First", vec)
Cs <- grep("^\\s*C. The Third", vec)

Now I use these line numbers to look up the lines between them and collapse them into strings. Note that y - 1 removes the line starting with "C. The Third". If you want to keep that one as well, remove - 1:

df <- data.frame(
  text = mapply(function(x, y) paste(vec[x:(y - 1)], collapse = "\n"), As, Cs),
  stringsAsFactors = FALSE
)
df
#>                                                                                                                                                                                            text
#> 1  A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.\n    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
#> 2  A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.\n    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.

^{Created on 2020-03-17 by the reprex package (v0.3.0)}

R - Extract and Parse Each Instance of Multi-line Delimited Text by 2 Strings into Individual Rows (.txt to data.frame)

2 Answers2

data

answer