0

I started learning Python a few days ago in order to build a basic site in order to compile some statistics from BOINC projects eg SETI@home etc.

Basically the site does:

  • Download gz files
  • Uncompress gz files into xml files
  • Build xml info into data structures
  • Write data structures back into cvs files

In total there are 34 .gz files from 34 different BOINC projects.

All the code is now finished and works, however the .gz file from one project refuses to parse, whereas the other 34 work fine.

The file is:

user.gz

from

http://www.rnaworld.de/rnaworld/stats/

These are the errors that I am getting:

Traceback (most recent call last):
  File "C:/Users/chris/PycharmProjects/testproject1/rnaw100.py", line 77, in <module>
    for event, elem in ET.iterparse(str(x_file_name2), events=("start", "end")):
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1227, in iterator
    yield from pullparser.read_events()
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1302, in read_events
    raise event
  File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1274, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

This is the code that downloads the .gz file and parse's the XML: (I have left out var declarations etc)

As a newbie I am finding it difficult to understand what is wrong, as (a) the errors refers to a Python core file eg ElementTree.py, and (b) I can't understand why a .gz file which many other BOINC stat sites are using wont work here, and (c) why my code works on 34 files, but not this 1.

response = requests.get(url2, stream=True)

if response.status_code == 200:
    with open(target_path2, 'wb') as f:
        f.write(response.raw.read())

with gzip.open(target_path2, 'rb') as f_in:
    with open(x_file_name2, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

for event, elem in ET.iterparse(str(x_file_name2), events=("start", "end")):

    if elem.tag == "total_credit" and event == "end":
        tc=float(elem.text)
        elem.clear

    if elem.tag == "expavg_credit" and event == "end":
        ac=float(elem.text)
        elem.clear

    if elem.tag == "id" and event == "end":
        id=elem.text
        elem.clear

    if elem.tag == "cpid" and event == "end":
        cpid=elem.text
        elem.clear

    if elem.tag == "name" and event == "end":
        name = elem.text
        elem.clear()
    teamid=TEAMID

    if elem.tag == "teamid" and event == "end":
        if elem.text == TEAMID:
            cnt=cnt+1
            dic[id]={"Name":name,"CPID":cpid, "TC":tc, "AC":ac}
        elem.clear()
Chris
  • 81
  • 12
  • Open the unzipped file to see if the data is in XML format. – dabingsou Mar 11 '20 at 00:03
  • If I use 7zip and unpack the .gz file. And then open the unpacked XML file into Pycharm and re-save it. Then my code reads it no problem. – Chris Mar 11 '20 at 01:49

1 Answers1

0

Another solution.

from simplified_scrapy import SimplifiedDoc,req,utils
import gzip
with gzip.open('user.gz', 'rb') as f_in:
  with open('user.xml', 'wb') as f_out:
    f_out.write(f_in.read())
html = utils.getFileContent('user.xml')
doc = SimplifiedDoc(html)
users = doc.selects('user')
for user in users:
  tags = user.children

@Chris I decompress the file and save it. The data is correct. Try replacing your shutil with it.

import gzip
with gzip.open('user.gz', 'rb') as f_in:
    with open('user.xml', 'wb') as f_out:
        f_out.write(f_in.read())
dabingsou
  • 2,469
  • 1
  • 5
  • 8
  • I replaced my code with yours, but I get exactly the same error. – Chris Mar 13 '20 at 08:14
  • @Chris It works here, or you can try another solution. – dabingsou Mar 13 '20 at 10:48
  • I already tried changing shutil, it did not make any difference. The code I have works fine for the other 33 xml gz files. In the end I solved the problem by using RPC http to generate the xml files. This works fine using the same code. eg ElementTree iterparse. But thank you for your code suggestions. – Chris Mar 14 '20 at 11:31