getting incorrect HTML content of a page using urlib and request

Question

I used two methods to get Page Source of an internal application link.

first Method - used Robot Framework Keyword ${html_page} =. Get Source
Second Method -
- using request -- visit_url_content = urllib.request.urlopen(url).read().decode('utf-8') and
- visit_url_content = requests.get(url, 'html.parser').text

After getting page source i am extracting all links with tag a and attribute as 'href' using beautifulsoup. soup = BeautifulSoup(html_page, "html.parser")

with first method i get about 20 links but with second method i get 2 links only... I need to process this in python so cannot use robot framework option. Any help as to why it might be happening

why -1 on this question? – Mahak Malik Aug 10 '21 at 16:52 — Mahak Malik, Aug 10 '21 at 16:52

score 0 · Answer 1 · answered Aug 10 '21 at 21:42

It is a bit unclear how your code exactly looks like, since you only posted a few code snippets. I assume it looks something like this:

import urllib.request
from bs4 import BeautifulSoup

URL = "your-url"

html = urllib.request.urlopen(URL).read().decode('utf-8')

soup = BeautifulSoup(html, "html.parser")

for a in soup.find_all('a', href=True):
    print(a["href"])

Based on StackOverflow: BeautifulSoup getting href

Does this code differ in some way from yours? Can you share the complete code of yours that crawls the website / the URL you want to crawl? Otherwise it is hard to find out what the problem is.

getting incorrect HTML content of a page using urlib and request

1 Answers1