0

I've got a little script for gathering data from a list of websites. I've been using lynx for this but after going through the data, I noticed some sites were not returning any results.

#!/bin/bash

[ "$1" ] || exit 1

tmp=$(mktemp "${1}_XXXXXXXXX")

cat <<EOF > "$tmp"
https://google.com/search?q=${1}
https://duckduckgo.com/?q=${1}
https://www.bing.com/search?q=${1}
EOF

while read; do

    lynx -nonumbers -dump -hiddenlinks=merge -listonly "$REPLY" | \
    grep -i "${1}" | awk '!x[$0]++' >> file.txt

done < "$tmp"

rm "$tmp"

It turns out it's a certificate validation problem. And apparently lynx doesn't have a flag to ignore validation. While I understand that validation is everyone's best interest, I need to be able to pull data from every website in my list.

So I looked into using Python and BeautifulSoup instead. From this answer I'm able to pull the links from a single url. And from this answer ignore validation.

Using Python 3.6, this is what I have so far:

from bs4 import BeautifulSoup
import urllib.request
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

resp = urllib.request.urlopen('https://google.com', context=ctx)
soup = BeautifulSoup(resp, "lxml")

for link in soup.find_all('a', href=True):
    print(link['href'])

I'd like to pass the same list that's in the bash script to the Python script to pull the links from each of the URLs in the list. So essentially, each of line of this list

https://google.com/search?q=${1}
https://duckduckgo.com/?q=${1}
https://www.bing.com/search?q=${1}

would get passed as URLS to resp = urllib.request.urlopen('URLS', context=ctx)

How do I do this?

I0_ol
  • 1,054
  • 1
  • 14
  • 28
  • What exactly you want to do? Where do you to pass the list of sites? – user3764893 Jun 09 '17 at 08:31
  • I'm trying to get the same functionality as the bash script. I don't know Python at all. Most of what I know I learned today. I'm trying to get the Python script to read the same list as the bash script. – I0_ol Jun 09 '17 at 08:50
  • You can have a loop which iterates through your list, open the link and parse it. Would not that work? – user3764893 Jun 09 '17 at 08:54
  • Yeah, I just don't know how to do it in Python. – I0_ol Jun 09 '17 at 08:59

2 Answers2

1

Try Python string formatting.

'https://google.com/search?q=%s' % ('text',) yields 'https://google.com/search?q=text', if that's what you're looking for

codelessbugging
  • 2,849
  • 1
  • 14
  • 19
1

Read the site names, lets say from a list, iterate through them, send request and parse the response.

site_list = ['http://example.com', 'https://google.com']

for site in site_list:

    resp = urllib.request.urlopen(site)
    soup = BeautifulSoup(resp, "lxml")

    for link in soup.find_all('a', href=True):
        print(link['href'])
user3764893
  • 697
  • 5
  • 16
  • 33