How to HTML parse a URL list using python

Question

I have a URL list of 5 URLs within a .txt file named as URLlist.txt.

https://www.w3schools.com/php/php_syntax.asp
https://www.w3schools.com/php/php_comments.asp
https://www.w3schools.com/php/php_variables.asp
https://www.w3schools.com/php/php_echo_print.asp
https://www.w3schools.com/php/php_datatypes.asp

I need to parse all the HTML content within the 5 URLs one by one for further processing.

My current code to parse an individual URL -

import requests from bs4 
import BeautifulSoup as bs   #HTML parsing using beatuifulsoup

r = requests.get("https://www.w3schools.com/whatis/whatis_jquery.asp")
soup = bs(r.content)   
print(soup.prettify())

[Read the file line by line](https://stackoverflow.com/questions/3277503/how-to-read-a-file-line-by-line-into-a-list) and process each line one-by-one. — Ari Cooper-Davis, Mar 22 '22 at 09:53

score 0 · Answer 1 · answered Mar 22 '22 at 09:55

Your problem will be solved using line-by-line readying and then put that line in your request. sample:

import requests from bs4
import BeautifulSoup as bs   #HTML parsing using beatuifulsoup

f = open("URLlist.txt", "r")
for line in f:
    print(line) # CURRENT LINE
    r = requests.get(line)
    soup = bs(r.content)
    print(soup.prettify())

score 0 · Answer 2 · answered Mar 22 '22 at 10:00

0

Create a list of your links

with open('test.txt', 'r') as f:
    urls = [line.strip() for line in f]

Then u can loop your parse

for url in urls:
    r = requests.get(url)
    ...

answered Mar 22 '22 at 10:00

Sharkerz

23
3

score 0 · Accepted Answer · answered Mar 22 '22 at 10:26

The way you implement this rather depends on whether you need to process the URLs iteratively or whether it's better to gather all the content for subsequent processing. That's what I suggest. Build a dictionary where each key is a URL and the associated value is the text (HTML) return from the page. Use multithreading for greater efficiency.

import requests
from concurrent.futures import ThreadPoolExecutor

data = dict()

def readurl(url):
    try:
        (r := requests.get(url)).raise_for_status()
        data[url] = r.text
    except Exception:
        pass

def main():
    with open('urls.txt') as infile:
        with ThreadPoolExecutor() as executor:
            executor.map(readurl, map(str.strip, infile.readlines()))
    print(data)

if __name__ == '__main__':
    main()

How to HTML parse a URL list using python

3 Answers3