0

I'm trying to make a scraper that collects data out of an XML sitemap file. I wrote the program below. It works fine when entering one static URL. I downloaded the XML page containing all URLs of products. Is there a way to extract them and make a for each out of them to automate the process?

The XML file looks like this:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="xx" type="text/xsl"?>
<urlset xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" 
    xmlns:xhtml="http://www.w3.org/1999/xhtml" 
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>(URL IS HERE)</loc>
        <changefreq>daily</changefreq>
        <image:image>
            <image:loc>(URL OF PICTURE, not relevant</image:loc>
       </image:image>
    </url>

The code looks like this

from bs4 import BeautifulSoup
import requests

filename = "products.csv"
f = open(filename, "w")

headers = "Naam, prijs \n" 

f.write(headers)

print('step 1')
#get url
page_link = "<privacy>"
print('step 2')
#open page
page_response = requests.get(page_link, timeout=1)
print('step 3')
#parse page
page_content = BeautifulSoup(page_response.content, "html.parser")
print('step 4')
#naam van de pagina
price = page_content.find_all(class_='<privacy>')[0].decode_contents()
naam = page_content.find_all(class_='product-name')[0].decode_contents()
print('step 5')
#printen
print("Product:", naam, "kost nu", price)

f.write(naam + "," + price.replace(",", "|") +  "\n")
f.close()
  • Try to be a little clearer. Do you want to run the code you have for all the urls in the XML file? – teller.py3 Sep 26 '18 at 22:55
  • Questions are encouraged to provide an [MCVE](/help/mcve). – LMC Sep 26 '18 at 23:31
  • Possible duplicate of [parsing xml containing default namespace to get an element value using lxml](https://stackoverflow.com/questions/31177707/parsing-xml-containing-default-namespace-to-get-an-element-value-using-lxml) – LMC Sep 27 '18 at 00:17
  • Possible duplicate of [this question](https://stackoverflow.com/questions/31177707/parsing-xml-containing-default-namespace-to-get-an-element-value-using-lxml). – LMC Sep 27 '18 at 00:18

0 Answers0