I am using python on replit to convert XML files from endnote into a different XML format that can be opened in Microsoft Word's reference manager. When I tested on files with up to 6 records, it worked perfectly. When I tried on larger files with 30-50 records, it returned the following error:
File "main.py", line 12, in convert_endnote_to_word
root = ET.fromstring(endnote_xml_str)
File "/nix/store/2vm88xw7513h9pyjyafw32cps51b0ia1-python3-3.8.12/lib/python3.8/xml/etree/ElementTree.py", line 1320, in XML
parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2248
My code is below:
import xml.etree.ElementTree as ET
def convert_endnote_to_word(endnote_xml_str):
word_xml_str = '<?xml version="1.0" encoding="utf-8"?>\n'
word_xml_str += '<b:Sources xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" '
word_xml_str += 'xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" SelectedStyle="">\n'
root = ET.fromstring(endnote_xml_str)
for record in root.findall('./records/record'):
source = {}
word_xml_str += ' <b:Source>\n'
year = record.find('./pub-dates/date').text
source['year'] = year
word_xml_str += f' <b:Year>{source["year"]}</b:Year>\n'
source['ref_type'] = record.find('./ref-type').get('name').replace(" ", "")
word_xml_str += f' <b:SourceType>{source["ref_type"]}</b:SourceType>\n'
title = record.find('./titles/title').text
source['title'] = title
word_xml_str += f' <b:Title>{source["title"]}</b:Title>\n' authors = []
for author in record.findall('./contributors/authors/author'):
authors.append(author.text)
source['authors'] = authors
convert_authors_and_year(source)
word_xml_str += f' <b:Tag>{source["tag"]}</b:Tag>\n'
word_xml_str += f' <b:Author>\n{source["authors_xml"]} </b:Author>\n'
source['url'] = record.find('./urls/related-urls/url').text
word_xml_str += f' <b:URL>{source["url"]}</b:URL>\n'
word_xml_str += ' </b:Source>\n'
word_xml_str += '</b:Sources>'
return word_xml_str
def convert_authors_and_year(source):
authors = source['authors']
year = source['year']
last_names = [author.split()[-1] for author in authors]
tag = f'{", ".join(last_names)} ({year})'
source['tag'] = tag
source['authors_xml'] = ''
source['authors_xml'] += f' <b:Author>\n <b:NameList>\n'
for author in authors:
source['authors_xml'] += f' <b:Person>\n'
source['authors_xml'] += f' <b:Last>{author.split(" ")[-1]}</b:Last>\n'
source['authors_xml'] += f' <b:First>{author.split(" ")[0]}</b:First>\n'
source['authors_xml'] += ' </b:Person>\n'
source['authors_xml'] += ' </b:NameList>\n </b:Author>\n'
def save_word_file(endnote_xml_str,word_filepath: str='word_xml_file.xml'):
word_xml_str = convert_endnote_to_word(endnote_xml_str)
with open(word_filepath, 'w') as f:
f.write(word_xml_str)
def convert_file(endnote_filepath: str='endnote_xml_file.xml', word_filepath: str='word_xml_file.xml'):
with open(endnote_filepath, 'r') as f:
endnote_xml_str = f.read()
save_word_file(endnote_xml_str, word_filepath)
I was expecting that it would work the same way no matter the size of the file. I also tried to change the parsing to be in the following way to deal with the XML file directly using only one record at a time rather than using the whole thing at once:
** root = ET.iterparse(endnote_xml_str)**
** for event, record in root:**
** if event == 'end' and record.tag == 'record':**
source = {}
...
word_xml_str += ' </b:Source>\n'
** record.clear()**
word_xml_str += '</b:Sources>'
However, the changes above forced further changes that may be above my skill, as no matter how much I change there are always still errors. I am aware that I will have to change "endnote_xml_str" with the filename, but I would prefer to keep things simple and continue working on the original code handling the whole XML if possible.