10

Basically I am working on a python project where I download and index files from the sec edgar database. The problem however, is that when using the requests module, it take a very long time to save the text in a variable (between ~130 and 170 seconds for one file).

The file roughly has around 16 million characters, and I wanted to see if there was any way to easily lower the time it takes to retrieve the text. -- Example:

import requests

url ="https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"

r = requests.get(url, stream=True)

print(r.text)

Thanks!

Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313
Jake Schurch
  • 135
  • 1
  • 1
  • 8

2 Answers2

12

What I found is in the code for r.text, specifically when no encoding was given ( r.encoding == 'None' ). The time spend detecting the encoding was 20 seconds, I was able to skip it by defining the encoding.

...
r.encoding = 'utf-8' 
...

Additional details

In my case, my request was not returning an encoding type. The response was 256k in size, the r.apparent_encoding was taking 20 seconds.

Looking into the text property function. It tests to see if there is an encoding. If there is None, it will call the apperent_encoding function which will scan the text to autodetect the encoding scheme.

On a long string this will take time. By defining the encoding of the response ( as described above), you will skip the detection.

Validate that this is your issue

in your above example :

from datetime import datetime    
import requests

url = "https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"

r = requests.get(url, stream=True)

print(r.encoding)

print(datetime.now())
enc = r.apparent_encoding
print(enc)

print(datetime.now())
print(r.text)
print(datetime.now())

r.encoding = enc
print(r.text)
print(datetime.now())

of course the output may get lost in the printing, so I recommend you run the above in an interactive shell, it may become more aparent where you are losing the time even without printing datetime.now()

AlainChiasson
  • 121
  • 1
  • 6
  • that's a brilliant answer, thanks! We ran into a similar issue that was randomly occuring. I was able to track it down to response.text and replaced it with response.content because it returned raw bytes and the performance was normal at that point. However, I couldn't find much explanation online. This answer really brings clarity! – Simon Ninon Aug 13 '20 at 19:34
1

From @martijn-pieters

Decoding and printing 15MB of data to your console is often slower than loading data from a network connection. Don't print all that data. Just write it straight to a file.

Jake Schurch
  • 135
  • 1
  • 1
  • 8