0

I'm trying to use read some data from the web but I'm having an unexpected problem. I call it unexpected because if I print the web I'm trying to reading, it exists and it gives no problems. However, when I use the following code (see below) I receive the so-called error "HTTP Error 404: Not Found with an existing url". But the url exists (see here)... Does anyone know what am I doing wrong? Thanks!

    import pandas as pd
    from bs4 import BeautifulSoup
    import urllib.request as ur
        
    index = 'MSFT'
    url_is = 'https://finance.yahoo.com/quote/' + index + '/financials?p=' + index
    # Readdata
    read_data = ur.urlopen(url_is).read()

R__
  • 117
  • 1
  • 7
  • 1
    `import requests` then `read_data = requests.get(url_is, headers = {'User-Agent':'Mozilla/.0'}).text` – QHarr Nov 05 '21 at 12:35

2 Answers2

1

Using requests module and injecting User-Agent, response status is 200 as follows:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}
index = 'MSFT'
url_is = 'https://finance.yahoo.com/quote/' + index + '/financials?p=' + index
r = requests.get(url_is, headers=headers)
print(r.status_code)
#page = BeautifulSoup(r.content, 'lxml')

Output:

200
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
1

Some sites require a valid "User-Agent" identifier header. In your example with urllib, as the URL parameter of urlopen can also be a Request object, you could specify the headers in the Request object along with the url, as below:

from urllib.request import Request, urlopen

index = 'MSFT'
url_is = 'https://finance.yahoo.com/quote/' + index + '/financials?p=' + index
req = Request(url_is, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
Archivec
  • 321
  • 2
  • 6
  • Fantastic! It works. Is there any preference on the 'User-Agent'? I mean, why that one and not other, is it any difference? – R__ Nov 05 '21 at 13:56
  • @R__ I am not entirely sure but I found this answer [here](https://stackoverflow.com/a/24274846/15704825) which may help. – Archivec Nov 08 '21 at 06:11