python replit XML conversion not working for larger files - invalid token

Question

I am using python on replit to convert XML files from endnote into a different XML format that can be opened in Microsoft Word's reference manager. When I tested on files with up to 6 records, it worked perfectly. When I tried on larger files with 30-50 records, it returned the following error:

  File "main.py", line 12, in convert_endnote_to_word
    root = ET.fromstring(endnote_xml_str)
  File "/nix/store/2vm88xw7513h9pyjyafw32cps51b0ia1-python3-3.8.12/lib/python3.8/xml/etree/ElementTree.py", line 1320, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2248

My code is below:

import xml.etree.ElementTree as ET
def convert_endnote_to_word(endnote_xml_str):
    word_xml_str = '<?xml version="1.0" encoding="utf-8"?>\n'
    word_xml_str += '<b:Sources xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" '
    word_xml_str += 'xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" SelectedStyle="">\n' 
    root = ET.fromstring(endnote_xml_str)
    for record in root.findall('./records/record'):
        source = {}
        word_xml_str += '  <b:Source>\n'
        year = record.find('./pub-dates/date').text
        source['year'] = year
        word_xml_str += f'    <b:Year>{source["year"]}</b:Year>\n'  
        source['ref_type'] = record.find('./ref-type').get('name').replace(" ", "")
        word_xml_str += f'    <b:SourceType>{source["ref_type"]}</b:SourceType>\n'
        title = record.find('./titles/title').text
        source['title'] = title
        word_xml_str += f'    <b:Title>{source["title"]}</b:Title>\n'        authors = []
        for author in record.findall('./contributors/authors/author'):
            authors.append(author.text)
        source['authors'] = authors
        convert_authors_and_year(source)
        word_xml_str += f'    <b:Tag>{source["tag"]}</b:Tag>\n'
        word_xml_str += f'    <b:Author>\n{source["authors_xml"]}    </b:Author>\n'  
        source['url'] = record.find('./urls/related-urls/url').text
        word_xml_str += f'    <b:URL>{source["url"]}</b:URL>\n'
        word_xml_str += '  </b:Source>\n'
    word_xml_str += '</b:Sources>'
    return word_xml_str

def convert_authors_and_year(source): 
    authors = source['authors']
    year = source['year']
    last_names = [author.split()[-1] for author in authors]
    tag = f'{", ".join(last_names)} ({year})'
    source['tag'] = tag
    source['authors_xml'] = ''
    source['authors_xml'] += f'      <b:Author>\n        <b:NameList>\n'
    for author in authors:
        source['authors_xml'] += f'          <b:Person>\n'
        source['authors_xml'] += f'            <b:Last>{author.split(" ")[-1]}</b:Last>\n'
        source['authors_xml'] += f'            <b:First>{author.split(" ")[0]}</b:First>\n'
        source['authors_xml'] += '          </b:Person>\n'
    source['authors_xml'] += '        </b:NameList>\n      </b:Author>\n'

def save_word_file(endnote_xml_str,word_filepath: str='word_xml_file.xml'):
    word_xml_str = convert_endnote_to_word(endnote_xml_str)
    with open(word_filepath, 'w') as f:
      f.write(word_xml_str)

def convert_file(endnote_filepath: str='endnote_xml_file.xml', word_filepath: str='word_xml_file.xml'):
  with open(endnote_filepath, 'r') as f:
    endnote_xml_str = f.read()
  save_word_file(endnote_xml_str, word_filepath)

I was expecting that it would work the same way no matter the size of the file. I also tried to change the parsing to be in the following way to deal with the XML file directly using only one record at a time rather than using the whole thing at once:

**    root = ET.iterparse(endnote_xml_str)**
**    for event, record in root:**
**      if event == 'end' and record.tag == 'record':**
        source = {}
        ...
        word_xml_str += '  </b:Source>\n'
**      record.clear()**
    word_xml_str += '</b:Sources>'

However, the changes above forced further changes that may be above my skill, as no matter how much I change there are always still errors. I am aware that I will have to change "endnote_xml_str" with the filename, but I would prefer to keep things simple and continue working on the original code handling the whole XML if possible.

If endnote_xml_str is not well-formed, it is broken. It cannot be parsed as XML. — mzjn, Jan 07 '23 at 11:31
[What's so bad about building XML with string concatenation?](https://stackoverflow.com/q/3034611/1422451) — Parfait, Jan 08 '23 at 05:06

python replit XML conversion not working for larger files - invalid token

0 Answers0