I am trying to scrape an XML file from sec.gov and just convert it to one long string, but it just returns a byte string of bunch of addresses, I don't know how to get it to just come back as a string, or an object that I can convert to a string.
like for example I just want a string form of this:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
<author>
<email>webmaster@sec.gov</email>
<name>Webmaster</name>
</author>
<company-info>
<addresses>
<address type="mailing">
<city>CHATHAM</city>
<state>NJ</state>
<street1>26 MAIN STREET, SUITE 101</street1>
<zip>07928</zip>
</address>
<address type="business">
<city>CHATHAM</city>
Here is my code:
#!/usr/bin/python3
from lxml import html
from lxml import etree
import requests
from time import sleep
import json
import argparse
from random import randint
import sys
from urllib.parse import urlencode
from urllib.request import Request, urlopen
from pprint import pprint
import traceback
from xml.etree import ElementTree
def parse_finance_page(urlAddress):
headers = {
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7",
"Connection":"keep-alive",
"Cache-Control":"no-store, no-cache, must-revalidate, max-age=0'",
"Cache-Control":"post-check=0, pre-check=0",
"Pragma":"no-cache",
"Host":"www.sec.gov",
"Referer":"https://www.sec.gov",
"Upgrade-Insecure-Requests":"1",
"User-Agent":"name@address.com"
}
for retries in range(5):
try:
request = Request("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001430306&type=&dateb=&owner=include&start=0&count=40&output=atom", headers=headers)
html = urlopen(request).read()
print(html)
xmlString = ElementTree.tostring(html, encoding='unicode')
print(xmlString)
quit()
The first print just prints out what looks like a byte string of a bunch of memory addresses, for example:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9dko\xdb8\x16\x86\xbf\xef\xaf \xf2a0\x83\xadc\x92\xba\xab\xb1g\xbdi\x8af\xda\xa4\x9d&3\xbb\x8b\xc5b\xa0\xd8t"\xd4\x96\x0cIn\x92\xf9\xf5\xcbC\xc9\x97X\x8adMMZq\x05\x14\xa9M\xd32ux\xce\xfbH\xbc\x1c\x9d\xfc\xfc0\x9d\xa0\xaf,\x8a\xfd0\xe8\x1d\x91c|\x84X0\x0cG~p\xdb;:\xbf\xfa\xd8\xb1m\xc3\xe9\x90#\xf4s\xffo\x08\x9d\x8c\x19\x1b!\xfe\x95 \xee\x1d\xdd%\xc9\xcc\xedv\xef\xef\xef\x8f\xef\xb5\xe30\xba\xedR\x8c\x8d\xee \t\xa7GP\x99W\xf7\xe6\xc9]\x18\xa5o\xf8[6\xf5\xfcI\xff\x9e\xddL\xbd8a\xd1?b6<\xbe\r\xbf\x9et\xd3\x0f\x16\xd5\x02o\xca\xfa\xffZ\xd4:\xe9\x8a\xf7\xe9\x01\xbb\xebG<\x19\x86\xd3\x99\x17<v\xfc\x1c.\xbf\xed\x8dF\x11\x8bc\x16/JVe(y\x9c\xb1\xde\x11\xfc\x18?\xbf\xa3U\x058\x96\x9f<\xf6O\xdf\r\xae\xdf\r.N\xba\xe2\xdd\xfa\xc7q\xe2%\xac\x7f\xf9\xcbI7}\xf5\xf4\xb3\x88\xb1\x84\xf4\xa9\x89.\x06\xe7\x97\xe8\xea\xfa\xf3\xd9\xd9\xf5+t\xf5\xdb\xf9\xf5\x19"\x98\xc0\x97\xd2*\xeb_\xfb\xd3\x9f\xf5\xb1\xe5P\xfb\xa4\x0b/W\xad\xedf\xcd}\xf6\x04n\xe6\xb1\x1f\xf0\xb7\xb5\xce
v\x17\x06\xacO\t\xed86\xee8\xc40N\xbaiYS\xcesY\xb0\xea\xbb\x13/\x8e\xfd\xdb\x80\x8d:\xb1?\xecS[\xd3y\xa5\xf5\xa2\xa2z\x9d\x11\x8b\x87\xfdO\xef\x06\x9f/\x06\xa7g\xbf]\x9f\x9f\x0e>\xa0O\x9f\xcf>\r>\x0f\xae\xcf?
And it goes on and on and on...
The lines:
xmlString = ElementTree.tostring(html, encoding='unicode')
print(xmlString)
Just produce:
Failed to process the request, Exception:'bytes' object has no attribute 'iter'
Any help is greatly appreciated.