Getting error while web scraping the link

Question

Getting an error while scraping the link given. Can anybody please help me out with the error, And code for scraping web for the link to get all the text data.

from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link) 
webpage = urlopen(req).read()

It would help a lot if you posted the error... – John Gordon Mar 14 '21 at 04:09 — John Gordon, Mar 14 '21 at 04:09

Jacob Lee · Accepted Answer · 2021-03-14T04:19:21.140

You could try using requests:

>>> import requests
>>> res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
>>> res.raise_for_status()
>>> res.text
'\r\n<!DOCTYPE html><html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>...'

In order to get the content of the page (the actual story, in this case), you would likely need a web scraper, such as BeautifulSoup4 or lxml.

BeautifulSoup4

import bs4
import requests

res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
soup = bs4.BeautifulSoup(res.text, features="html.parser")
elem = soup.select("#chapter-content div:nth-child(3) div")[0]
content = elem.getText()

BeautifulSoup4 is a third-party module, so be sure to install it: pip install BeautifulSoup4.

lxml

from urllib.request import urlopen
from lxml import etree

res = urlopen("https://novelfull.com/warriors-promise/chapter-1.html")
htmlparser = etree.HTMLparser()
tree = etree.parse(res, htmlparser)
elem = tree.xpath("//div[@id='chapter-content']//div[3]//div")
content = elem.text

lxml is a third-party module, so be sure to install it: pip install lxml

Thanks for the answer but neither of the both is woking. Both Beautifulsoup and lxml is showing error — desktopp, Mar 14 '21 at 04:13
@desktopp Did you install them? They are both third-party module, so you would have to run `pip install BeautifulSoup4` and `pip install lxml`, respectively. — Jacob Lee, Mar 14 '21 at 04:17

score 1 · Answer 2 · answered Mar 14 '21 at 04:31

Setting the user agent in the header as if calling from browser seems to work to avoid the HTTP 403: Forbidden error, e.g.:

from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
webpage = urlopen(req).read()

You can also see this question for a similar case

Getting error while web scraping the link

2 Answers2

BeautifulSoup4

lxml