Scrape data from multiple webpages using a .txt file that contains the URLs with Python and beautiful soup

Question

I have a .txt file that contains the complete URLs to a number of pages that each contain a table I want to scrape data off of. My code works for one URL, but when I try to add a loop and read in the URLs from the .txt file I get the following error

raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?

Here's my code

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:

    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    page_soup = soup(page_html, "html.parser")

    containers = page_soup.findAll("tr", {"class":"data"})


    for container in containers:
        unform_name = container.findAll("th", {"width":"30%"})
        name = unform_name[0].text.strip()

        unform_delegate = container.findAll("td", {"id":"y000"})
        delegate = unform_delegate[0].text.strip()

        print(name)
        print(delegate)

f.close()

I've checked my .txt file and all the entries are normal. They start with HTTP: and end with .html. There are no apostrophes or quotes around them. I'm I coding the for loop incorrectly?

Using

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)

I get the following

??http://www.thegreenpapers.com/PCC/AL-D.html

http://www.thegreenpapers.com/PCC/AL-R.html

http://www.thegreenpapers.com/PCC/AK-D.html

And so forth on 100 lines. Only the first line has question marks. My .txt file contains those URLs with only the state and party abbreviation changing.

Brian O'Donnell · Accepted Answer · 2018-03-31T18:59:04.657

You can't read the whole file into a string using 'f.read()' and then iterate on the string. To resolve see the change below. I also removed your last line. When you use the 'with' statement it will close the file when the code block finishes.

Code from Greg Hewgill for (Python 2) shows if the url string is of type 'str' or 'unicode'.

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

# Code from Greg Hewgill
def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)
        whatisthis(url)
        uClient = urlopen(url)
        page_html = uClient.read()
        uClient.close()

        page_soup = soup(page_html, "html.parser")

        containers = page_soup.findAll("tr", {"class":"data"})

        for container in containers:
            unform_name = container.findAll("th", {"width":"30%"})
            name = unform_name[0].text.strip()

            unform_delegate = container.findAll("td", {"id":"y000"})
            delegate = unform_delegate[0].text.strip()

            print(name)
            print(delegate)

Running the code with a text file with the URLs listed above produces this output:

http://www.thegreenpapers.com/PCC/AL-D.html

ordinary string
Gore, Al
54.   84%
Uncommitted
10.   16%
LaRouche, Lyndon

http://www.thegreenpapers.com/PCC/AL-R.html

ordinary string
Bush, George W.
44.  100%
Keyes, Alan

Uncommitted

http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13.   68%
Uncommitted
6.   32%
Bradley, Bill

Those are great suggestions, but unfortunately I'm still getting the same error. — Kristen Funk, Mar 31 '18 at 03:16
It sounds like you have a file that was not encoded in utf-8. You may have Unicode characters. See my updated code to find out. — Brian O'Donnell, Mar 31 '18 at 18:07

score 1 · Answer 2 · answered Mar 31 '18 at 13:11

1

The way you have tried can be fixed by twitching two different lines in your code.

Try this:

with open('urls.txt', 'r') as f:
    urls = f.readlines()   #make sure this line is properly indented.
for url in urls:
    uClient = urlopen(url.strip())

answered Mar 31 '18 at 13:11

SIM

21,997
5
37
109

Scrape data from multiple webpages using a .txt file that contains the URLs with Python and beautiful soup

2 Answers2