I couldn't find an answer online; forgive me if this is a duplicate question.
I have a column containing thousands of links to .txt webpages. I would like to parse/read them. they have text and html codes in them. Here's one example: link
I couldn't find an answer online; forgive me if this is a duplicate question.
I have a column containing thousands of links to .txt webpages. I would like to parse/read them. they have text and html codes in them. Here's one example: link
The pages contain html documents embedded in a text file. It's easy to extract them by looking for the HTML tags. Once you've done that you can store them in a list in process all the html with an lapply
command
url <- paste0("https://www.sec.gov/Archives/edgar/data/1096759/",
"000126246313000226/0001262463-13-000226.txt")
page <- readLines(url)
start <- grep("<HTML>", page)
finish <- grep("</HTML>", page)
htmls <- mapply(function(x, y) paste0(page[x:y], collapse = "\n"), start, finish)
lapply(htmls, function(x) read_html(x) %>% html_text()) -> result
This gives:
cat(result[[1]])
#> 29
#>
#>
#>
#> Cash and Cash Equivalents
#>
#>
#>
#> Cash and cash equivalents include highly liquid investments
#> with original maturities of three months or less.
#>
#>
#>
#> Foreign Currency Translation
#>
#>
#>
#> The Company’s functional and
#> reporting currency is U.S. dollars. The consolidated financial statements of the Company are translated to U.S. dollars in accordance
#> with ASC 830, “Foreign Currency Matters.” Monetary assets and liabilities denominated in foreign currencies
#> are translated using the exchange rate prevailing at the balance sheet date. Gains and losses arising on translation or settlement
#> of foreign currency denominated transactions or balances are included in the determination of income. The Company has not, to the
#> date of these consolidated financial statements, entered into derivative instruments to offset the impact of foreign currency fluctuations.
### etc...
It really depends on how consistently these files are laid out, but if they always have this table at the top, you could do this:
library(XML)
x <- readLines("https://www.sec.gov/Archives/edgar/data/1096759/000126246313000226/0001262463-13-000226.txt")
i <- readHTMLTable(x, stringsAsFactors = FALSE)
address <- i[[1]][grep("Address of principal executive offices", i[[1]][[1]]) - 1, 1]
That assumes that your address will always be in the first table on the page and that the address will be one line that appears right above the text. It might need some adjustment.