0

I am having issues parsing through log files containing the & character, but only when it is not followed up by amp;. Can something be done before parsing or do I have to look for faults elsewhere?

I am getting the xml.etree.ElementTree.ParseError: not well-formed (invalid token) error, and I have isolated the & to be the only special, out of the ordinary, character on that line. Having the & followed up by amp; poses no issue.

Syntax:

import xml.etree.ElementTree as ET
import os
import errno

path = "C:\\Users\\SuperUser\\Desktop\\audit\\audit\\saved\\audit"

for filename in os.listdir(path):
    with open(path + "\\" + filename) as myfile:
        lines = myfile.readlines()

    xmlfile = open("logins.xml", "w")

    for line in lines:
        # print(ET.fromstring(line))
        xmlVal = ET.fromstring(line)
        finder = "UserAuthenticated/Action"
        if xmlVal.find(finder) is not None and xmlVal.find(finder).text == 'Login':
            username = xmlVal.find("UserAuthenticated/LocalUsername").text
            timestamp = xmlVal.find("TimeStamp").text
            xmlToWrite = '<?xml version="1.0" encoding="UTF-8"?><root><Username>' + username + '</Username><Timestamp>' + timestamp + '</Timestamp></root>\n'
            xmlfile.write(xmlToWrite)
            print("Writing '" + xmlToWrite + "' to logins.xml")

    xmlfile.close()
Lycan
  • 23
  • 5

1 Answers1

1

This post:

Creating a simple XML file using python

has examples for how to write an XML file using Python ElementTree.

It's always best to use a library for creating XML rather than trying to write it as plain text. Escaping special characters is one reason; another is to ensure you get the start and end tags and namespaces right. We see a lot of people struggling to parse broken XML on StackOverflow, and it's usually because someone wrongly thought it would be easy to hand-generate it rather than using a library for the job.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Thank you for your answer. It does not however solve my problem at the moment. The issue I'm having is that `line` here `xmlVal = ET.fromstring(line)` sometimes contains the `&` that it can't handle. Unless I can somehow read `line` in and escape it before I hand it to `ElementTree` my problem still stands, I think. – Lycan May 16 '18 at 07:26
  • No, the serialization library in ElementTree will take care of any escaping that's needed. – Michael Kay May 16 '18 at 07:35
  • Somehow it doesn't... It still complains. Everything else works, but it doesn't seem to escape it. – Lycan May 16 '18 at 08:02
  • sorry for old thread revival but I had the same issue. For me it helped to unescape original string and then save it normally with `tree.write(xmlFile, encoding="UTF-8",xml_declaration=None, method="xml")`. I had field `value = ""22.2""` and needed to replace 22.2 inside. `value = '"'+variable+'"'` worked well – darth0s Jul 16 '20 at 11:53