how to parse/read a .txt webpage in R?

Question

I couldn't find an answer online; forgive me if this is a duplicate question.

I have a column containing thousands of links to .txt webpages. I would like to parse/read them. they have text and html codes in them. Here's one example: link

You might want to look at [rvest](https://github.com/tidyverse/rvest) and [purrr](https://github.com/tidyverse/purrr) — David, Jan 24 '20 at 15:48
It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Jan 24 '20 at 15:48
It really depends on what you want to do with it. You could look into the XML package, since it looks like it's XML. That will make it into an object you can work with, but obviously you're going to have to do way more work to make this into a data frame that you can run analysis on. — Caroline, Jan 24 '20 at 15:51
@Phil, that looks more like an html page with a header than XML. — r2evans, Jan 24 '20 at 15:55
@MrFlick I would like to save "Address of principal executive offices" in the company's 10K form. Here's the [html page](https://www.sec.gov/Archives/edgar/data/1096759/000126246313000226/ntm10ka.htm). Unfortunately older 10K submissions don't come in html format and only come in .txt. — jayjunior, Jan 24 '20 at 15:59
@G.Grothendieck yes, I created the data using edgarWebR. basically for each company, I have created this dataset containing all their 10K submissions. Now I want to extract their HQ address (my end goal). unfortunately, old 10K submissions only come in .txt webpage like above. — jayjunior, Jan 24 '20 at 16:02
@Phil exactly. newer 10K submissions are easy to parse, because they come in html format. the old ones, however, are .txt pages like above. Forgive me as I am ery new to this, but is there a way to turn this .txt into an html and then parse ? I am looking for "Address of principal executive offices" — jayjunior, Jan 24 '20 at 16:03
Suggest you try parse_text_filing with various arguments. If that still does not work the help page indicates that the author is open to being contacted. Also I noticed that the package was removed from CRAN a few days ago and you might inquire about that. https://cran.r-project.org/web/packages/edgarWebR/index.html — G. Grothendieck, Jan 24 '20 at 16:10
@r2evans since html is a type of XML, that's fine. The XML package has some useful functions for dealing with html specifically too, like `getHTMLLinks` and `readHTMLTable` that I use all the time for web scraping. — Caroline, Jan 24 '20 at 16:15

score 2 · Accepted Answer · answered Jan 24 '20 at 16:12

The pages contain html documents embedded in a text file. It's easy to extract them by looking for the HTML tags. Once you've done that you can store them in a list in process all the html with an lapply command

url <- paste0("https://www.sec.gov/Archives/edgar/data/1096759/",
              "000126246313000226/0001262463-13-000226.txt")

page   <- readLines(url)
start  <- grep("<HTML>", page)
finish <- grep("</HTML>", page)

htmls <- mapply(function(x, y) paste0(page[x:y], collapse = "\n"), start, finish)
lapply(htmls, function(x) read_html(x) %>% html_text()) -> result

This gives:

cat(result[[1]])
#>     29
#>      
#>     
#> 
#> Cash and Cash Equivalents
#> 
#>  
#> 
#> Cash and cash equivalents include highly liquid investments
#> with original maturities of three months or less.
#> 
#>  
#> 
#> Foreign Currency Translation
#> 
#>  
#> 
#> The Company’s functional and
#> reporting currency is U.S. dollars. The consolidated financial statements of the Company are translated to U.S. dollars in accordance
#> with ASC 830, “Foreign Currency Matters.” Monetary assets and liabilities denominated in foreign currencies
#> are translated using the exchange rate prevailing at the balance sheet date. Gains and losses arising on translation or settlement
#> of foreign currency denominated transactions or balances are included in the determination of income. The Company has not, to the
#> date of these consolidated financial statements, entered into derivative instruments to offset the impact of foreign currency fluctuations.
### etc...

score 1 · Answer 2 · answered Jan 24 '20 at 16:24

It really depends on how consistently these files are laid out, but if they always have this table at the top, you could do this:

library(XML)
x <- readLines("https://www.sec.gov/Archives/edgar/data/1096759/000126246313000226/0001262463-13-000226.txt")
i <- readHTMLTable(x, stringsAsFactors = FALSE)

address <- i[[1]][grep("Address of principal executive offices", i[[1]][[1]]) - 1, 1]

That assumes that your address will always be in the first table on the page and that the address will be one line that appears right above the text. It might need some adjustment.

how to parse/read a .txt webpage in R?

2 Answers2