Pulling data from a list of websites with BeautifulSoup

Question

I've got a little script for gathering data from a list of websites. I've been using lynx for this but after going through the data, I noticed some sites were not returning any results.

#!/bin/bash

[ "$1" ] || exit 1

tmp=$(mktemp "${1}_XXXXXXXXX")

cat <<EOF > "$tmp"
https://google.com/search?q=${1}
https://duckduckgo.com/?q=${1}
https://www.bing.com/search?q=${1}
EOF

while read; do

    lynx -nonumbers -dump -hiddenlinks=merge -listonly "$REPLY" | \
    grep -i "${1}" | awk '!x[$0]++' >> file.txt

done < "$tmp"

rm "$tmp"

It turns out it's a certificate validation problem. And apparently lynx doesn't have a flag to ignore validation. While I understand that validation is everyone's best interest, I need to be able to pull data from every website in my list.

So I looked into using Python and BeautifulSoup instead. From this answer I'm able to pull the links from a single url. And from this answer ignore validation.

Using Python 3.6, this is what I have so far:

from bs4 import BeautifulSoup
import urllib.request
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

resp = urllib.request.urlopen('https://google.com', context=ctx)
soup = BeautifulSoup(resp, "lxml")

for link in soup.find_all('a', href=True):
    print(link['href'])

I'd like to pass the same list that's in the bash script to the Python script to pull the links from each of the URLs in the list. So essentially, each of line of this list

https://google.com/search?q=${1}
https://duckduckgo.com/?q=${1}
https://www.bing.com/search?q=${1}

would get passed as URLS to resp = urllib.request.urlopen('URLS', context=ctx)

How do I do this?

What exactly you want to do? Where do you to pass the list of sites? — user3764893, Jun 09 '17 at 08:31
I'm trying to get the same functionality as the bash script. I don't know Python at all. Most of what I know I learned today. I'm trying to get the Python script to read the same list as the bash script. — I0_ol, Jun 09 '17 at 08:50
You can have a loop which iterates through your list, open the link and parse it. Would not that work? — user3764893, Jun 09 '17 at 08:54

score 1 · Answer 1 · answered Jun 09 '17 at 08:33

1

Try Python string formatting.

'https://google.com/search?q=%s' % ('text',) yields 'https://google.com/search?q=text', if that's what you're looking for

answered Jun 09 '17 at 08:33

codelessbugging

2,849
1
14
19

Thank you for your help. – I0_ol Jun 10 '17 at 19:44

score 1 · Answer 2 · answered Jun 09 '17 at 09:05

1

Read the site names, lets say from a list, iterate through them, send request and parse the response.

site_list = ['http://example.com', 'https://google.com']

for site in site_list:

    resp = urllib.request.urlopen(site)
    soup = BeautifulSoup(resp, "lxml")

    for link in soup.find_all('a', href=True):
        print(link['href'])

answered Jun 09 '17 at 09:05

user3764893

697
5
16
33

Thank you, this was helpful. – I0_ol Jun 10 '17 at 19:44

Pulling data from a list of websites with BeautifulSoup

2 Answers2