I believe this is a loop and gregexpr() issue. I'm trying to extract/export multi-line text from i number of standardized instances within i number of standardized .txt forms into a data frame where each instance is a separate row. So far, I can successfully extract the string data (though the algorithm extracts a little more than the stated gregexpr() parameters) but can only export as .txt as a lump sum of text.
- How can I create a data frame of the extracted txt-files' text where each instance of multi-line text has its own row? (Once the data is in a data.frame format, I know how to export as xlsx from there.)
- How can I extract only the data from the parameters I have set?
With help (particularly from Ben from the comments of this post), here is what I have so far:
# Txt Data Format
txt 1 <-
"A. The First: abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
C. The Third: abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
A. The First: abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
C. The Third: abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz."
abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
txt 2 <-
"A. The First: abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
C. The Third: abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
A. The First: abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
C. The Third: abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz."
abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
#################################
# Directory and Text Extraction #
#################################
dest <- "C:/~"
docs_text <- list.files(path = dest, pattern = "txt", full.names = TRUE)
## Assumes that all the content I want to extract is between "A." and "C." in
## the text while ignoring "C." and "D." content.
docs_list <- list.files(path = dest, pattern = "txt", full.names = TRUE)
docs_doc <- lapply(docs_list, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?<=A. The First).*?(?=C. The Third)", j, perl=TRUE))
})
lapply(1:length(docs_doc), function(i) write.table(docs_doc[i], file=paste(docs_list[i], " ",
" ", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))
Current output looks like this where all of the text is in one line and captures more than just between "A." and "C.":
Desired output would look like this where the multi-line text between any instance of "A." and "C." is extracted and assigned one line each:
Any help you could provide would be tremendously helpful!
I'm ultimately trying to develop an NLP model that can extract standardized form data from hundreds of large PDFs for a year over year repository. If this post suggests I'm not thinking about how to approach this problem efficiently/effectively, I'm open to direction.
Thanks in advance!