2

I have 200,000 links that I am trying to download, I have tried downloading it all in one go but I ran into memory issues.

I am trying to create a function which will download 1000 links at a time and save them in a folder.

Packages:

library(dplyr)
library(purrr)
library(edgarWebR)

A small sample of the data is as follows:

Data 1:

urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm"
)

I then apply the following function to download these 10 links

parsed_files <- map(urls_to_parse, possibly(parse_filing, otherwise = NA))

Which stores it as a nice list, I can then apply names(parsed_files) <- urls_to_parse to name the lists as the links from where they were downloading them from. I can also use output <- plyr::ldply(parsed_files, data.frame) to store everything in a nice data frame.

Using the below data, how could I create batches to download the data in say batches of 10?

What I have currently:

start = 1
end = 100

output <- NULL
output_fin <- NULL

for(i in start:end){
  output[[i]] <- map(urls_to_parse[[i]], possibly(parse_filing, otherwise = NA))
  names(output) <- urls_to_parse[start:end]
  save(output_fin, file = paste0("C:/Users/Downloads/data/",i, "output.RData"))
}

I am sure there is a better way using a function, since this code breaks for some of the results.

More data: - 100 links

urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746908008126/a2186742z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000110465907055173/a07-18543_110k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000110465906047248/a06-15961_110k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000110465905033688/a05-12324_110k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746904023905/a2140220z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746903028005/a2116671z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000091205702033450/a2087919z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000095012310108231/c61492e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000095015208010514/n48172e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000095013707018659/c22309e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000095013707000193/c11187e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000095013406000594/c01109e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000120677405000032/d16006.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000120677404000013/d13773.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000104746903001075/a2097401z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000091205702001614/a2067550z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/319126/000115752308008030/a5800571.htm", 
"https://www.sec.gov/Archives/edgar/data/319126/000115752307009801/a5515869.htm", 
"https://www.sec.gov/Archives/edgar/data/319126/000115752306009238/a5227919.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046908000102/alpharmainc_10k.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046907000017/alo10k2006.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046906000027/alo10k2005.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046905000021/alo10k2004final.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046904000058/alo10k2003master.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046903000001/alo10k.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046902000004/alo10k2001.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046901500003/alo.htm", 
"https://www.sec.gov/Archives/edgar/data/4515/000000620118000009/a10k123117.htm", 
"https://www.sec.gov/Archives/edgar/data/4515/000119312517051216/d286458d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/4515/000119312516474605/d78287d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/4515/000119312515061145/d829913d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000000620113000023/amr-10kx20121231.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000119312512063516/d259681d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000095012311014726/d78201e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000000620110000006/ar123109.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000000620109000009/ar120810k.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000000451508000014/ar022010k.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000095013407003888/d43815e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000095013406003715/d33303e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000095013405003726/d22731e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000095013404002668/d12953e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000104746903013301/a2108197z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/65695/000095013407003823/h42902e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/65695/000095012906002343/h31028e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/65695/000095012905002955/h22337e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000156459018005085/cece-10k_20171231.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000156459017004264/cece-10k_20161231.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000156459016015157/cece-10k_20151231.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312515095828/d864880d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312514098407/d661608d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312513109153/d444138d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312512119293/d293768d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312511067373/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312510069639/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312509055504/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312508058939/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312507071909/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312506068031/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312505077739/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312504052176/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000110465910047121/a10-16705_110k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000114420409046933/v159572_10k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000110465906060737/a06-19311_110k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000104746905022854/a2162888z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000104746904028585/a2143353z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000104746903031974/a2119476z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000143774918010388/avx20180331_10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916317000028/avx-20170331x10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916316000079/avx-20160331x10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916315000024/avx-20150331x10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916314000035/avx-20140331x10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916313000022/avx-20130331x10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916312000024/avxform10kfy12.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916311000013/avxform10kfy11.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916310000020/avxform10kfy10.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916309000117/form10kfy09.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000192/form10qq1fy09.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000101/form10kfy08.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916307000122/form10kfy07.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916306000102/avxfy06form10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916305000094/fy0510k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916304000091/fy0410k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916303000020/fy0310k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916302000007/r10k-0302.htm", 
"https://www.sec.gov/Archives/edgar/data/7286/000076462218000018/pnw2017123110-k.htm", 
"https://www.sec.gov/Archives/edgar/data/7286/000076462217000010/pnw2016123110-k.htm", 
"https://www.sec.gov/Archives/edgar/data/7286/000076462216000087/pnw2015123110-k.htm", 
"https://www.sec.gov/Archives/edgar/data/7286/000076462215000013/pnw12311410-k.htm", 
"https://www.sec.gov/Archives/edgar/data/7286/000110465914012068/a13-25897_110k.htm"
)
user113156
  • 6,761
  • 5
  • 35
  • 81

1 Answers1

0

Looping over to do batch job as you showed is a bad idea. If you have a 1000s of files to be downloaded, how do you recover from errors?

The performance is not solely depend on your computer's configuration, but the network performance is crucial.

Here are couple of suggestions.

Option 1

Why do I use a queue? Because you could retry on error easily.

A pseudo code


file_url_partitions <- partion_as_batches(all_urls, batch_size) 
attempts = 3
while( file_url_partitions is not empty && attempt <= 3 ) {
  batch = file_url_partitions.pop()

  tryCatch({
   download_parallel(batch)
  }, some_exception = function(se) {
    file_url_partitions.push(batch)
    attemp = attempt+1 
  })
}

Note: I don't have access to R studio/environment now hence no way to try.

Option 2 Download files separately using a download manager/similar and use downloaded files.

Some useful resources: https://www.r-bloggers.com/r-with-parallel-computing-from-user-perspectives/ http://adv-r.had.co.nz/beyond-exception-handling.html

Laksitha Ranasingha
  • 4,321
  • 1
  • 28
  • 33