I've got a little script for gathering data from a list of websites. I've been using lynx
for this but after going through the data, I noticed some sites were not returning any results.
#!/bin/bash
[ "$1" ] || exit 1
tmp=$(mktemp "${1}_XXXXXXXXX")
cat <<EOF > "$tmp"
https://google.com/search?q=${1}
https://duckduckgo.com/?q=${1}
https://www.bing.com/search?q=${1}
EOF
while read; do
lynx -nonumbers -dump -hiddenlinks=merge -listonly "$REPLY" | \
grep -i "${1}" | awk '!x[$0]++' >> file.txt
done < "$tmp"
rm "$tmp"
It turns out it's a certificate validation problem. And apparently lynx
doesn't have a flag to ignore validation. While I understand that validation is everyone's best interest, I need to be able to pull data from every website in my list.
So I looked into using Python and BeautifulSoup instead. From this answer I'm able to pull the links from a single url. And from this answer ignore validation.
Using Python 3.6, this is what I have so far:
from bs4 import BeautifulSoup
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
resp = urllib.request.urlopen('https://google.com', context=ctx)
soup = BeautifulSoup(resp, "lxml")
for link in soup.find_all('a', href=True):
print(link['href'])
I'd like to pass the same list that's in the bash script to the Python script to pull the links from each of the URLs in the list. So essentially, each of line of this list
https://google.com/search?q=${1}
https://duckduckgo.com/?q=${1}
https://www.bing.com/search?q=${1}
would get passed as URLS
to resp = urllib.request.urlopen('URLS', context=ctx)
How do I do this?