BS4 and urllib: gather some links and store it in an array

Question

I am attempting to gather some data on the cost of living index for some towns in USA/Texas getting it from the website below: http://www.city-data.com/city/Texas.html

Approach: for the sake of repeatingly extract links out of the targetpage i use the function below:

from bs4 import BeautifulSoup
import requests
import re

def getLinks(url):
    r = requests.get("http://www.city-data.com/city/Texas.html")
    soup = BeautifulSoup(r.content)
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))
##It will scrape all the a tags, and for each a tags, it will append the href attribute to the links list.

    return links

print( getLinks("http://www.city-data.com/city/Texas.html") )

dataset: http://www.city-data.com/city/Texas.html that contains the following pages that hold information about the towns with inhabitants:

Abilene, TX 120,958 
Abram-Perezville 6,663 
Addison, TX 15,457 
Alamo Heights 7,806 
Alamo, TX 19,224 
Aldine 15,869 
Alice, TX 19,395 
Allen, TX 94,179 
Alton North 6,182

note: what is aimed to gather the data out of the sub-pages: therefore i need a parser that loops through the subpages - eg like the following:

http://www.city-data.com/city/Abilene-Texas.html http://www.city-data.com/city/Abram-Perezville-Texas.html http://www.city-data.com/city/Addison-Texas.html http://www.city-data.com/city/Alamo-Heights-Texas.html

and so forth - but at the moment i get back

ModuleNotFoundError: No module named 'BeautifulSoup'

PS: in the first attemt i used urllib2 - but this is python2 - so i changed it to urllib3 but i am not sure if this is correct - and if i have this module running in my Anaconda. This is pretty important. By the way: what about the following term: urllib2.urlopen - that seems to be outdated too!? I need to re-write this also. What do you think!? Look forward to hear from you! At the moment i am a bit confused about the urllib.urlopen-term!?

update: thanks to the hint of Andrej and Guilherme, i saw that i have the following setup in the packages:

so i need to recode the plugins that i import. Many thanks for the hint!

@Andrej Kesely hello dear Andrej many thanks for the hint. By the way: what about the urllib2.urlopen - that seems to be outdated too!? I need to re-write this also. What do you think!? Look forward to hear from you! — zero, May 18 '21 at 19:35
Hint: do not use `urllib2`. Much, much better library with good API is `requests` — Andrej Kesely, May 18 '21 at 19:37
many thanks for the hint: i will try to rewrite this part of the code. I will have a closer look at the manpages. Thank you dear Andrej - youre just great! — zero, May 18 '21 at 19:42
hi there dear Andrej - i am glad to see your awesome work - note i have some lines of code where i struggfle with - see also here https://stackoverflow.com/questions/76362600/getting-data-out-of-clutch-co-with-bs4-and-requests-failed i am trying to figure out - can you help here - thanks in advance!! You rock!! keep up the great work ps: see also https://stackoverflow.com/questions/76409097/driver-webdriver-chrome-issues-with-a-selenium-approach-how-to-work-aro look forward to hear from you — malaga, Jun 07 '23 at 11:14

score 1 · Accepted Answer · answered May 18 '21 at 18:19

1

Change your code to

from bs4 import BeautifulSoup

Run the pip list command on your terminal and ensure that a Beuatifulsoup library is installed.

example

C:\Users\xxxx>pip list
Package                Version
---------------------- ----------
beautifulsoup4         4.8.2

answered May 18 '21 at 18:19

Guilherme Aguiar

26
5

hello dear Guilherme - many many thanks - i have put the output of the pip-request in the update. Later the evening i will re-run the code. I have to hurry to catch the train now. Many many thanks for all you did! – zero May 18 '21 at 18:47
again - thanks for the hint. By the way: what about the urllib2.urlopen - that seems to be outdated too!? I need to re-write this also. What do you think!? Look forward to hear from you! – zero May 18 '21 at 19:36
2

hello @zero ! Sorry for the delay in replying. As Andrej replied, use the request library instead of urllib2. I believe you have already managed to resolve it. – Guilherme Aguiar Jun 23 '21 at 02:17

BS4 and urllib: gather some links and store it in an array

1 Answers1