1

I have thousands of XML files like follow

<names>
    <Id>1518845</Id>
    <Name>Confessions of a Thug (Paperback)</Name>
    <Authors>Philip Meadows Taylor</Authors>
    <Publisher>Rupa & Co</Publisher>
    <CountsOfReview>2.0</CountsOfReview>
</names>

I've tried the codes follow to parse

from lxml import etree

root = etree.parse("xm_file.xml")
import xml.etree.ElementTree as ET

tree = ET.parse("xm_file.xml")

and

parser = ET.XMLParser(encoding="utf-8")
tree = ET.parse("xm_file.xml", parser=parser)

and all lead to one of those errors

ParseError: not well-formed (invalid token): line 10, column 18
XMLSyntaxError: xmlParseEntityRef: no name, line 10, column 19

I searched and tried a lot for a solution for this to work to all files but in vain

NOTE : this didnt help me : How to parse invalid (bad / not well-formed) XML?

another situation is

<names>
    <Id>1481744</Id>
    <Name>Lettres de René-Édouard Claparède <1832-1871>.: Choisies et annotées</Name>
    <Authors>René-Édouard Claparède</Authors>
    <ISBN>3796505635</ISBN>
    <Rating>2.0</Rating>
    <PublishYear>1971</PublishYear>
    <PublishMonth>31</PublishMonth>
    <PublishDay>12</PublishDay>
</names>

while parsing it just handle the XML as if it is :

<names>
    <Id>1481744</Id>
    <Name>Lettres de René-Édouard Claparède</Name>
</names>

and other info doesnt appear

  • 1
    Maybe this helps? https://stackoverflow.com/questions/7604436/xmlparseentityref-no-name-warnings-while-loading-xml-into-a-php-file – Jan Apr 30 '21 at 19:53
  • 1
    This is python not PHP – Ashraf Khaled Apr 30 '21 at 19:53
  • 2
    But the solution is the same. – Jan Apr 30 '21 at 19:58
  • 1
    It’s not XML, Jim, at least not as we know it. Your question isn’t titled correctly - what you’re trying to parse *isn’t XML* – DisappointedByUnaccountableMod Apr 30 '21 at 20:48
  • Its an XML with invalid format – Ashraf Khaled Apr 30 '21 at 20:51
  • 1
    No, it's ***not*** XML. @barny is right. You did not understand the duplicate link the last time you asked this exact question. You cannot expect an XML parser, which is written based on following the rules that *define* XML, to succeed with arbitrary transgressions against those rules. – kjhughes Apr 30 '21 at 21:24
  • 1
    The `&` and `<` characters cannot appear in content without being escaped because those unescaped characters have special meaning in XML. If you get textual data that you wish to repair to be XML, it's a hard problem to solve automatically. Re-read the [duplicate link](https://stackoverflow.com/q/44765194/290085) for more details and for guidance on how to proceed. ***Do not just keep repeating your post.*** Your case is covered there; it is not special. – kjhughes Apr 30 '21 at 21:27
  • 3
    You don't have thousands of XML files. You have thousands of non-XML files. In fact, you have a heap of junk. – Michael Kay Apr 30 '21 at 21:28
  • Ok thanks alot for this – Ashraf Khaled Apr 30 '21 at 21:51

1 Answers1

3

You could replace the & before-hand:

import xml.etree.ElementTree as ET

data = """

<names>
    <Id>1518845</Id>
    <Name>Confessions of a Thug (Paperback)</Name>
    <Authors>Philip Meadows Taylor</Authors>
    <Publisher>Rupa & Co</Publisher>
    <CountsOfReview>2.0</CountsOfReview>
</names>

"""

data = data.replace('&', '&amp;')
tree = ET.ElementTree(ET.fromstring(data))

for publisher in tree.findall("Publisher"):
    print(publisher.text)

Which yields

Rupa & Co

A possible way would be to load the files in question before, replace the & and feed it to xml.etree.ElementTree, as in:

with open("some_cool_file") as fp:
    content = fp.read()
    content = content.replace('&', '&amp;')
    xml = ET.ElementTree(ET.fromstring(content))
Jan
  • 42,290
  • 8
  • 54
  • 79