0

My question might have asked earlier but the scenario im working for im not getting any help.

Been tried different methods and things but still no luck, any help would be appreciated

Question

Im trying to load a text file from URL https://www.sec.gov/Archives/edgar/cik-lookup-data.txt, so I can modify the data and create a dataframe.

Example:- data from the link

1188 BROADWAY LLC:0001372374:

119 BOISE, LLC:0001633290:

11900 EAST ARTESIA BOULEVARD, LLC:0001639215:

11900 HARLAN ROAD LLC:0001398414:

11:11 CAPITAL CORP.:0001463262:

I should get below output

   Name                              | number 
   1188 BROADWAY LLC                 | 0001372374 
   119 BOISE, LLC                    | 0001633290 
   11900 EAST ARTESIA BOULEVARD, LLC | 0001639215 
   11900 HARLAN ROAD LLC             | 0001398414 
   11:11 CAPITAL CORP.               | 0001463262

Im struck at 1st problem to load the text file, im keep getting 403 url HTTPError: HTTP Error 403: Forbidden

Reference used:

  1. Given a URL to a text file, what is the simplest way to read the contents of the text file?
  2. Python requests. 403 Forbidden

My Code:-

import urllib.request  # the lib that handles the url stuff

data = urllib.request.urlopen("https://www.sec.gov/Archives/edgar/cik-lookup-data.txt") # it's a file like object and works just like a file
for line in data: # files are iterable
    print (line)
Deepak
  • 430
  • 1
  • 7
  • 14
  • Interesting, I wonder if it's something to do with the user agent? Edit: nvm, that's in your link, maybe it's something similar? – Peter Feb 14 '22 at 15:32

2 Answers2

4

The returned error message says:

Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic. Please declare your traffic by updating your user agent to include company specific information.

You can resolve this as follows:

import urllib

url = "https://www.sec.gov/Archives/edgar/cik-lookup-data.txt"
hdr = {'User-Agent': 'Your Company Name admin@domain.com'} #change as needed

req = urllib.request.Request(url, headers=hdr) 

data = urllib.request.urlopen(req, timeout=60).read().splitlines()

>>> data[:10]
[b'!J INC:0001438823:',
 b'#1 A LIFESAFER HOLDINGS, INC.:0001509607:',
 b'#1 ARIZONA DISCOUNT PROPERTIES LLC:0001457512:',
 b'#1 PAINTBALL CORP:0001433777:',
 b'$ LLC:0001427189:',
 b'$AVY, INC.:0001655250:',
 b'& S MEDIA GROUP LLC:0001447162:',
 b'&TV COMMUNICATIONS INC.:0001479357:',
 b'&VEST DOMESTIC FUND II KPIV, L.P.:0001802417:',
 b'&VEST DOMESTIC FUND II LP:0001800903:']
not_speshal
  • 22,093
  • 2
  • 15
  • 30
  • Nice! How do you see the error message being returned? – JNevill Feb 14 '22 at 16:02
  • `try`/`except` on the `urlopen` and catch and print all exceptions. – not_speshal Feb 14 '22 at 16:02
  • Nice! - How did you know to use the User-Agent that you did? Is there any sort of standard convention there? Did something about the 403 response prompt you to try that? I'm curious because I've done a fair amount of web scraping and haven't seen something as simple as this work so well. – CryptoFool Feb 14 '22 at 16:08
  • @CryptoFool: FAQs on the SEC website. See [here](https://www.sec.gov/os/webmaster-faq#user-agent) – not_speshal Feb 14 '22 at 16:10
0

It is disallowed - so you are getting response_code = 403. It is a good practice to check the robots.txt file while scraping any web page. A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; however it is not a mechanism for keeping a web page out of Google.

In your case it is https://www.sec.gov/robots.txt

Rajarshi Ghosh
  • 452
  • 1
  • 9
  • Where does it state in the `robots.txt` that `/Archives` or `/Archives/edgar` or `/Archives/edgar/cik-lookup-data.txt` is Disallowed? I must be overlooking it. – JNevill Feb 14 '22 at 15:44
  • more importantly it says it allows ```Allow: /Archives/edgar/data``` - check it by trying to GET something from this endpoint – Rajarshi Ghosh Feb 14 '22 at 15:54
  • `/Archives/edgar/data` doesn't have anything to do with the path of the file being requested. I don't see anything in robots.txt that would cause the 403. I think something sneakier than `robots.txt` is happening. – JNevill Feb 14 '22 at 16:00
  • `robots.txt` in and of itself does nothing to prevent requests to specific URLs on a site. It's an advisory mechanism that the search engines choose to honor. It might give you insight into what parts of the site are being monitored for "bad" request patterns (bad User-Agent, for example), but any such mechanisms technically have nothing to do with `robots.txt`. And what good would it do to refer to `robots.txt` anyway? If you need a certain page from a site, you need that page. If you see the page you want "Disallowed" by `robots.txt`, how should this change your approach? – CryptoFool Feb 14 '22 at 16:19