first post here, countless of other times I encountered problems that were already solved but can't figure this out.
The following while loop is intented to download the text contained in a list of urls (3 in the example). It does it for all the links but this link (this), the 3rd in the example, takes very long. By opening it in the browser it appears like it has a number of jpg images. (the txt represents a number of documents merged in 1 file; in this specific example, some of the documents were images).
The images can be recognized in the text thanks to these lines before them:
<DOCUMENT>
<TYPE>GRAPHIC
<SEQUENCE>7
<FILENAME>tex99-4_pg01.jpg
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 tex99-4_pg01.jpg
and they are followed by this code:
end
</TEXT>
</DOCUMENT>
Is there any way to SKIP that link if the download takes too long so to make the scraper faster? I am looking to apply this code to 320K of these links and would like to find a way to make the download faster rather than "cutting" the txt I get afterwards
This is what I am currently using to scrape:
import pandas as pd
import requests
list_of_links = ['https://www.sec.gov/Archives/edgar/data/1000298/0001193125-14-321757.txt',
'https://www.sec.gov/Archives/edgar/data/1002225/0000903423-14-000495.txt',
'https://www.sec.gov/Archives/edgar/data/1004724/0001144204-14-042745.txt'] # the one with the images
number_of_urls = len(list_of_links) # get number of links to iterate them
i = 0
column_names = ["Filing text"]
DF = pd.DataFrame(columns = column_names)
while i < number_of_urls:
print("File #", i+1,"\tis being processed") # print this to visually see how long each download takes
DF.loc[i, "Filing text"] = requests.get(list_of_links[i]).text
i += 1