I am a seller on Target.com and am trying to scrape the URL for every product in my catalog using Python (Python 3). When I try this I get an empty list for 'urllist', and when I print the variable 'soup', what BS4 has actually collected is the contents "view page source" (forgive my naiveté here, definitely a novice at this still!). In reality I'd really like to be scraping URLs from the content found in the "elements" section of the Devtools page. I can sift through the html on that page manually and find the links, so I know they're in there...I just don't know enough yet to tell BS4 that's the content I want to search. How can I do that?
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
#Need this part below for HTTPS
ctx=ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#Needs that context = ctx line to deal with HTTPS
url = input('Enter URL: ')
urllist=[]
html = urllib.request.urlopen(url, context = ctx).read()
soup=BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
urllist.append(link.get('href'))
print(urllist)
If it helps, I found code that someone developed in Java that can be run from the developer console that works and grabbed me all of my links. But my goal is to be able to do this in Python (Python 3)
var x = document.querySelectorAll("a");
var myarray = []
for (var i=0; i<x.length; i++){
var nametext = x[i].textContent;
var cleantext = nametext.replace(/\s+/g, ' ').trim();
var cleanlink = x[i].href;
myarray.push([cleantext,cleanlink]);
};
function make_table() {
var table = '<table><thead><th>Name</th><th>Links</th></thead><tbody>';
for (var i=0; i<myarray.length; i++) {
table += '<tr><td>'+ myarray[i][0] + '</td><td>'+myarray[i][1]+'</td></tr>';
};
var w = window.open("");
w.document.write(table);
}
make_table()