1

first post here, countless of other times I encountered problems that were already solved but can't figure this out.

The following while loop is intented to download the text contained in a list of urls (3 in the example). It does it for all the links but this link (this), the 3rd in the example, takes very long. By opening it in the browser it appears like it has a number of jpg images. (the txt represents a number of documents merged in 1 file; in this specific example, some of the documents were images).

The images can be recognized in the text thanks to these lines before them:

<DOCUMENT>
<TYPE>GRAPHIC
<SEQUENCE>7
<FILENAME>tex99-4_pg01.jpg
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 tex99-4_pg01.jpg

and they are followed by this code:

end
</TEXT>
</DOCUMENT>

Is there any way to SKIP that link if the download takes too long so to make the scraper faster? I am looking to apply this code to 320K of these links and would like to find a way to make the download faster rather than "cutting" the txt I get afterwards

This is what I am currently using to scrape:

import pandas as pd
import requests

list_of_links = ['https://www.sec.gov/Archives/edgar/data/1000298/0001193125-14-321757.txt',  
                 'https://www.sec.gov/Archives/edgar/data/1002225/0000903423-14-000495.txt', 
                 'https://www.sec.gov/Archives/edgar/data/1004724/0001144204-14-042745.txt'] # the one with the images
number_of_urls = len(list_of_links)  # get number of links to iterate them
i = 0         

column_names = ["Filing text"]
DF = pd.DataFrame(columns = column_names)

while i < number_of_urls: 
    print("File #", i+1,"\tis being processed") # print this to visually see how long each download takes
    DF.loc[i, "Filing text"] = requests.get(list_of_links[i]).text
    i += 1
ab ab
  • 37
  • 4
  • oh man -- don't append to your dataframe like that. Populate late when in instantiate: `pandas.DataFrame([requests.get(link).text for link in list_of_links], columns=columns_names)` – Paul H May 14 '20 at 19:41

1 Answers1

0

For the requests library, you can check this answer: https://stackoverflow.com/a/22347526/8294752

In your case, since it's always just text data, looking at the content length should be enough.

On a different note, it's always good practice to include a timeout to a requests call. This will not solve your problem, as a timeout only considers the time during which the server does not answer, but not having it can create problems, especially in a loop.

EDIT

This could be a working solution to your problem (I've incorporated Paul H's feedback here as well):

Please note that I swapped the second and the last URL in the list, so you can better assess the time savings. Also, remember that you will have some None if the df at the end and to set a meaningful CONTENT_LENGTH_LIMIT depending on the data that you want to download

import pandas as pd
import requests

list_of_links = ['https://www.sec.gov/Archives/edgar/data/1000298/0001193125-14-321757.txt',
                 'https://www.sec.gov/Archives/edgar/data/1004724/0001144204-14-042745.txt',  
                 'https://www.sec.gov/Archives/edgar/data/1002225/0000903423-14-000495.txt', 
                 ] # the one with the images     

column_names = ["Filing text"]
DF = pd.DataFrame(columns = column_names)

CONTENT_LENGTH_LIMIT = 8070869 # just a limit which is lower than the size of the third link
def fetch_text(url):
    print("Url", url,"\tis being processed") # print this to visually see how long each download takes
    try:
        r = requests.get(url, stream=True, timeout=5)
        if int(r.headers.get('Content-Length')) > CONTENT_LENGTH_LIMIT:
            print("Url", url,"\thas been skipped")
            return None
        else: 
            text = "".encode() # the chunks are in bytes
            for chunk in r.iter_content(1024):
                text += chunk
        return text.decode()
    except requests.exceptions.Timeout as e:
        print("The server did not respond in time")
        # here it will return None, you can filter it later

DF = pd.DataFrame([fetch_text(link) for link in list_of_links],
                    columns=column_names)
arabinelli
  • 1,006
  • 1
  • 8
  • 19