Python - unable to convert the results of a scraped XML file (from sec.gov) to raw bytes

Question

I am trying to scrape an XML file from sec.gov and just convert it to one long string, but it just returns a byte string of bunch of addresses, I don't know how to get it to just come back as a string, or an object that I can convert to a string.

like for example I just want a string form of this:

<?xml version="1.0" encoding="ISO-8859-1" ?>
  <feed xmlns="http://www.w3.org/2005/Atom">
    <author>
      <email>webmaster@sec.gov</email>
      <name>Webmaster</name>
    </author>
    <company-info>
      <addresses>
        <address type="mailing">
          <city>CHATHAM</city>
          <state>NJ</state>
          <street1>26 MAIN STREET, SUITE 101</street1>
          <zip>07928</zip>
        </address>
        <address type="business">
          <city>CHATHAM</city>

Here is my code:

#!/usr/bin/python3

from lxml import html
from lxml import etree
import requests
from time import sleep
import json
import argparse
from random import randint
import sys
from urllib.parse import urlencode
from urllib.request import Request, urlopen
from pprint import pprint
import traceback
from xml.etree import ElementTree


def parse_finance_page(urlAddress):

  headers = {
          "Accept-Encoding":"gzip, deflate",
          "Accept-Language":"en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7",
          "Connection":"keep-alive",
          "Cache-Control":"no-store, no-cache, must-revalidate, max-age=0'",
          "Cache-Control":"post-check=0, pre-check=0",
          "Pragma":"no-cache",
          "Host":"www.sec.gov",
          "Referer":"https://www.sec.gov",
          "Upgrade-Insecure-Requests":"1",
          "User-Agent":"name@address.com"
    }

  for retries in range(5):
    try:

      request = Request("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001430306&type=&dateb=&owner=include&start=0&count=40&output=atom", headers=headers)
      html = urlopen(request).read()

      print(html)
  
      xmlString = ElementTree.tostring(html, encoding='unicode')
      print(xmlString)

      quit()

The first print just prints out what looks like a byte string of a bunch of memory addresses, for example:

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9dko\xdb8\x16\x86\xbf\xef\xaf \xf2a0\x83\xadc\x92\xba\xab\xb1g\xbdi\x8af\xda\xa4\x9d&3\xbb\x8b\xc5b\xa0\xd8t"\xd4\x96\x0cIn\x92\xf9\xf5\xcbC\xc9\x97X\x8adMMZq\x05\x14\xa9M\xd32ux\xce\xfbH\xbc\x1c\x9d\xfc\xfc0\x9d\xa0\xaf,\x8a\xfd0\xe8\x1d\x91c|\x84X0\x0cG~p\xdb;:\xbf\xfa\xd8\xb1m\xc3\xe9\x90#\xf4s\xffo\x08\x9d\x8c\x19\x1b!\xfe\x95 \xee\x1d\xdd%\xc9\xcc\xedv\xef\xef\xef\x8f\xef\xb5\xe30\xba\xedR\x8c\x8d\xee \t\xa7GP\x99W\xf7\xe6\xc9]\x18\xa5o\xf8[6\xf5\xfcI\xff\x9e\xddL\xbd8a\xd1?b6<\xbe\r\xbf\x9et\xd3\x0f\x16\xd5\x02o\xca\xfa\xffZ\xd4:\xe9\x8a\xf7\xe9\x01\xbb\xebG<\x19\x86\xd3\x99\x17<v\xfc\x1c.\xbf\xed\x8dF\x11\x8bc\x16/JVe(y\x9c\xb1\xde\x11\xfc\x18?\xbf\xa3U\x058\x96\x9f<\xf6O\xdf\r\xae\xdf\r.N\xba\xe2\xdd\xfa\xc7q\xe2%\xac\x7f\xf9\xcbI7}\xf5\xf4\xb3\x88\xb1\x84\xf4\xa9\x89.\x06\xe7\x97\xe8\xea\xfa\xf3\xd9\xd9\xf5+t\xf5\xdb\xf9\xf5\x19"\x98\xc0\x97\xd2*\xeb_\xfb\xd3\x9f\xf5\xb1\xe5P\xfb\xa4\x0b/W\xad\xedf\xcd}\xf6\x04n\xe6\xb1\x1f\xf0\xb7\xb5\xcev\x17\x06\xacO\t\xed86\xee8\xc40N\xbaiYS\xcesY\xb0\xea\xbb\x13/\x8e\xfd\xdb\x80\x8d:\xb1?\xecS[\xd3y\xa5\xf5\xa2\xa2z\x9d\x11\x8b\x87\xfdO\xef\x06\x9f/\x06\xa7g\xbf]\x9f\x9f\x0e>\xa0O\x9f\xcf>\r>\x0f\xae\xcf?

And it goes on and on and on...

The lines:

xmlString = ElementTree.tostring(html, encoding='unicode')
print(xmlString)

Just produce:

Failed to process the request, Exception:'bytes' object has no attribute 'iter'

Any help is greatly appreciated.

This doesn't answer your question, but I would recommend you just use `requests` and `beautifulsoup` to do what you want to do. See https://realpython.com/beautiful-soup-web-scraper-python/#step-3-parse-html-code-with-beautiful-soup — GordonAitchJay, Jun 14 '22 at 05:07
Hi mzin, is there a way to use fromstring() that involves headers? — Brent Heigold, Jun 14 '22 at 05:16
You provided a `Accept-Encoding: gzip, deflate` header which resulted in the gzip'd response. Either don't provide that header or use gzip to decode that response back into its original form, then you can pass the result into the XML parser. — metatoaster, Jun 14 '22 at 05:18

Python - unable to convert the results of a scraped XML file (from sec.gov) to raw bytes

0 Answers0