0

I try try to read a gzipped XML sitemap to pandas. Requests should be able to handle gzip automatically and in headers gzip is detected, but with gzip its not working showing "not well-formed (invalid token): line 1, column 0" but the sitemap looks fine to me?

import requests
import pandas as pd
import xmltodict
import numpy as np

url = "https://www.blick.ch/article.xml"
res = requests.get(url)
raw = xmltodict.parse(res.text)

dfAllLocs = pd.DataFrame({'loc': []})

for r in raw["sitemapindex"]["sitemap"]:
    #try: 
        print(r["loc"])
        resSingle = requests.get(r["loc"])
        #print(resSingle.headers)

        rawSingle = xmltodict.parse(resSingle.text, encoding='utf-8')
        dataSingle = [[rSingle["loc"]] for rSingle in rawSingle["urlset"]["url"]]
        dfSingle = pd.DataFrame(dataSingle, columns=["loc"])
        dfAllLocs = pd.concat([dfAllLocs,dfSingle])
        print(len(dfAllLocs))
    #except:
    #    print("something went wrong at: " + r["loc"])
Tobi
  • 1,702
  • 2
  • 23
  • 42
  • I tried to request the sitemap data and it seems to retrieve it just fine (`res.text` contains XML data). My guess is that the issue lies with `xmltodict.parse` rather than requests. – Ionut Ticus Apr 12 '20 at 12:20
  • 1
    Just realized your issue was further down the code; the issue is that requests will only handle *transport-level compression* automatically but in your case the content is compressed; more details [here](https://stackoverflow.com/a/32463456). – Ionut Ticus Apr 12 '20 at 12:30
  • maybe the example sitemap isn't the best, because the first sitemap isn't gzip in this example. https://www.blick.ch/article.xml ist better – Tobi Apr 12 '20 at 12:34
  • 1
    The issue is caused by confusing `Content-Encoding` (handled automatically by requests) and `Content-Type` (needs to be handled by you). For something like `https://www.blick.ch/article-2001-01.xml.gz` you'll need to handle decompression using something like [Gzip](https://docs.python.org/3/library/gzip.html) module. – Ionut Ticus Apr 12 '20 at 12:39

1 Answers1

1

Thanks Ionut Ticus. This link was super useful Having Trouble Getting requests==2.7.0 to Automatically Decompress gzip

Works now

#Get Sitemap
url = 'https://www.watson.ch/sitemap.xml'
pattern = '(.*?)\/'
maxSubsitemapsToCrawl = 10

res = requests.get(url)
raw = xmltodict.parse(res.text)

dfSitemap = pd.DataFrame({'loc': []})

breakcounter = 0
for r in raw["sitemapindex"]["sitemap"]:
    try: 
        print(r["loc"])
        resSingle = requests.get(r["loc"], stream=True)
        if resSingle.status_code == 200:
            if resSingle.headers['Content-Type'] == 'application/x-gzip':
                resSingle.raw.decode_content = True
                resSingle = gzip.GzipFile(fileobj=resSingle.raw)
            else: 
                resSingle = resSingle.text

            rawSingle = xmltodict.parse(resSingle)
            dataSingle = [[rSingle["loc"]] for rSingle in rawSingle["urlset"]["url"]]
            dfSingle = pd.DataFrame(dataSingle, columns=["loc"])
            dfSitemap = pd.concat([dfSitemap,dfSingle])
            print(len(dfSitemap))
    except:
        print("something went wrong at: " + r["loc"])

    breakcounter += 1
    if breakcounter == maxSubsitemapsToCrawl:
        break
Tobi
  • 1,702
  • 2
  • 23
  • 42