1

I believe this is a loop and gregexpr() issue. I'm trying to extract/export multi-line text from i number of standardized instances within i number of standardized .txt forms into a data frame where each instance is a separate row. So far, I can successfully extract the string data (though the algorithm extracts a little more than the stated gregexpr() parameters) but can only export as .txt as a lump sum of text.

  1. How can I create a data frame of the extracted txt-files' text where each instance of multi-line text has its own row? (Once the data is in a data.frame format, I know how to export as xlsx from there.)
  2. How can I extract only the data from the parameters I have set?

With help (particularly from Ben from the comments of this post), here is what I have so far:

# Txt Data Format
txt 1 <-
"A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz."
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.

txt 2 <-
"A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz."
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.


#################################
# Directory and Text Extraction #
#################################

dest <- "C:/~"
docs_text <- list.files(path = dest, pattern = "txt",  full.names = TRUE)

## Assumes that all the content I want to extract is between "A." and "C." in 
## the text while ignoring "C." and "D." content.

docs_list <- list.files(path = dest, pattern = "txt",  full.names = TRUE)
docs_doc <- lapply(docs_list, function(i) {
  j <- paste0(scan(i, what = character()), collapse = " ")
  regmatches(j, gregexpr("(?<=A. The First).*?(?=C. The Third)", j, perl=TRUE))
})

lapply(1:length(docs_doc),  function(i) write.table(docs_doc[i], file=paste(docs_list[i], " ", 
" ", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))

Current output looks like this where all of the text is in one line and captures more than just between "A." and "C.":

Current Output

Desired output would look like this where the multi-line text between any instance of "A." and "C." is extracted and assigned one line each:

Desired Output

Any help you could provide would be tremendously helpful!

I'm ultimately trying to develop an NLP model that can extract standardized form data from hundreds of large PDFs for a year over year repository. If this post suggests I'm not thinking about how to approach this problem efficiently/effectively, I'm open to direction.

Thanks in advance!

2 Answers2

2

Regex to the rescue.

First, your sample data is malformed, here's usable data.

txt1 <-
"A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz."
vec <- readLines(textConnection(txt1)) # 'textConnection' to read non-file

We first combine everything into one string, then search for (and split on) "A.":

paste("A.", Filter(nzchar, strsplit(paste(vec, collapse = ""), "\\bA\\. ")[[1]]))
# [1] "A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz. C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz. "
# [2] "A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz. C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz." 
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • I'm trying to grab text between two string sequences which I specify. I only want to extract text between "A. The First" and "C. The Third", however many times it may occur in my text file. – Bradley Thomas Anderson Mar 13 '20 at 20:05
  • Okay. For clarity, though, you need to provide clear expected output, I don't think the image is clear enough. – r2evans Mar 13 '20 at 20:17
0

Your question could be a little clearer as I'm not sure if lines starting with "C. The Third:" should be included or not. The solution below stops right before that line:

data

txt1 <-
  "A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz."
vec <- readLines(textConnection(txt1)) # 'textConnection' to read non-file

answer

First I note the line numbers which start with "A. The First" or "C. The Third". I allow some whitespace ("\\s*) between the start of the element (^) and the pattern.

As <- grep("^\\s*A. The First", vec)
Cs <- grep("^\\s*C. The Third", vec)

Now I use these line numbers to look up the lines between them and collapse them into strings. Note that y - 1 removes the line starting with "C. The Third". If you want to keep that one as well, remove - 1:

df <- data.frame(
  text = mapply(function(x, y) paste(vec[x:(y - 1)], collapse = "\n"), As, Cs),
  stringsAsFactors = FALSE
)
df
#>                                                                                                                                                                                            text
#> 1  A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.\n    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
#> 2  A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.\n    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.

Created on 2020-03-17 by the reprex package (v0.3.0)

JBGruber
  • 11,727
  • 1
  • 23
  • 45