How to extract unique string in between string pattern in full text in R?

Question

I'm looking to extract names and professions of those who testified in front of Congress from the following text:

text <- c(("FULL COMMITTEE HEARINGS\\", \\" 2017\\",\n\\" April 6, 2017—‘‘The 2017 Tax Filing Season: Internal Revenue\\", \", \"\\"\nService Operations and the Taxpayer Experience.’’ This hearing\\", \\" examined\nissues related to the 2017 tax filing season, including\\", \\" IRS performance,\ncustomer service challenges, and information\\", \\" technology. Testimony was\nheard from the Honorable John\\", \\" Koskinen, Commissioner, Internal Revenue\nService, Washington,\\", \", \"\\" DC.\\", \\" May 25, 2017—‘‘Fiscal Year 2018 Budget\nProposals for the Depart-\\", \\" ment of Treasury and Tax Reform.’’ The hearing\ncovered the\\", \\" President’s 2018 Budget and touched on operations of the De-\n\\", \\" partment of Treasury and Tax Reform. Testimony was heard\\", \\" from the\nHonorable Steven Mnuchin, Secretary of the Treasury,\\", \", \"\\" United States\nDepartment of the Treasury, Washington, DC.\\", \\" July 18, 2017—‘‘Comprehensive\nTax Reform: Prospects and Chal-\\", \\" lenges.’’ The hearing covered issues\nsurrounding potential tax re-\\", \\" form plans including individual, business,\nand international pro-\\", \\" posals. Testimony was heard from the Honorable\nJonathan Talis-\\", \", \"\\" man, former Assistant Secretary for Tax Policy 2000–\n2001,\\", \\" United States Department of the Treasury, Washington, DC; the\\",\n\\" Honorable Pamela F. Olson, former Assistant Secretary for Tax\\", \\" Policy\n2002–2004, United States Department of the Treasury,\\", \\" Washington, DC; the\nHonorable Eric Solomon, former Assistant\\", \", \"\\" Secretary for Tax Policy\n2006–2009, United States Department of\\", \\" the Treasury, Washington, DC; and\nthe Honorable Mark J.\\", \\" Mazur, former Assistant Secretary for Tax Policy\n2012–2017,\\", \\" United States Department of the Treasury, Washington, DC.\\",\n\\" (5)\\", \\"VerDate Sep 11 2014 14:16 Mar 28, 2019 Jkt 000000 PO 00000 Frm 00013\nFmt 6601 Sfmt 6601 R:\\\\DOCS\\\\115ACT.000 TIM\\"\", \")\")" )

The full text is available here: https://www.congress.gov/116/crpt/srpt19/CRPT-116srpt19.pdf

It seems that the names are in between "Testimony was heard from" until the next ".". So, how can I extract the names between these two patterns? The text is much longer (50 page document), but I figured that if I can do it one, I'll do it for the rest of the text.

I know I can't use NLP for name extraction because they are names of persons that didn't testify, for example.

The data you put in your question is very good, but very hard for us to put into our R session. Can you please run `dput(yourdata)` and paste the result into your question so others can easily use the data — stevec, Jan 22 '20 at 02:56
Also see https://stackoverflow.com/questions/39086400/extracting-a-string-between-other-two-strings-in-r — stevec, Jan 22 '20 at 02:59

score 3 · Accepted Answer · answered Jan 22 '20 at 08:58

NLP is likely unavoidable because of the many abbreviations in the text. Try this workflow:

Tokenize by sentence
Remove sentences without "Testimony"
Extract persons + professions from remaining sentences

There are a couple of packages with sentence tokenizers, but openNLP has generally worked best for me when dealing with abbreviation laden sentences. The following code should get you close to your goal:

library(tidyverse)
library(pdftools)
library(openNLP)

# Get the data
testimony_url <- "https://www.congress.gov/116/crpt/srpt19/CRPT-116srpt19.pdf"
download.file(testimony_url, "testimony.pdf")
text_raw <- pdf_text("testimony.pdf")

# Clean the character vector and smoosh into one long string.
text_string <- str_squish(text_raw) %>% 
    str_replace_all("- ", "") %>% 
    paste(collapse = " ") %>% 
    NLP::as.String()

# Annotate and extract the sentences.
annotations <- NLP::annotate(text_string, Maxent_Sent_Token_Annotator())
sentences <- text_string[annotations]

# Some sentences starting with "Testimony" list multiple persons. We need to
# split these and clean up a little.
name_title_vec <- str_subset(sentences, "Testimony was") %>% 
    str_split(";") %>% 
    unlist %>% 
    str_trim %>% 
    str_remove("^(Testimony .*? from|and) ") %>% 
    str_subset("^\\(\\d\\)", negate = T)

# Put in data frame and separate name from profession/title.
testimony_tibb <- tibble(name_title_vec) %>% 
    separate(name_title_vec, c("name", "title"), sep = ", ", extra = "merge")

You should end up with the below data frame. Some additional cleaning may be necessary:

# A tibble: 95 x 2
   name                       title                                                               
   <chr>                      <chr>                                                               
 1 the Honorable John Koskin… Commissioner, Internal Revenue Service, Washington, DC.             
 2 the Honorable Steven Mnuc… Secretary of the Treasury, United States Department of the Treasury…
 3 the Honorable Jonathan Ta… former Assistant Secretary for Tax Policy 2000–2001, United States …
 4 the Honorable Pamela F. O… former Assistant Secretary for Tax Policy 2002–2004, United States …
 5 the Honorable Eric Solomon former Assistant Secretary for Tax Policy 2006–2009, United States …
 6 the Honorable Mark J. Maz… "former Assistant Secretary for Tax Policy 2012–2017, United States…
 7 Mr. Daniel Garcia-Diaz     Director, Financial Markets and Community Investment, United States…
 8 Mr. Grant S. Whitaker      president, National Council of State Housing Agencies, Washington, …
 9 the Honorable Katherine M… Ph.D., professor of public policy and planning, and faculty directo…
10 Mr. Kirk McClure           Ph.D., professor, Urban Planning Program, School of Public Policy a…
# … with 85 more rows

How to extract unique string in between string pattern in full text in R?

1 Answers1