acessing xml data in machine files

Question

I want to read some analysis logs of machine data. The data I want to read are written as xml but are at the end of the files. I have the problem that the files start with unreadable machine data and I can't find a way to read the files and access the xml data. Since the machine data is written in more than 1000 lines, my favorite would be to ignore the machine data and just read in the xml data. Also the files are not ending on .xml but on .wve

Help would be greatly appreciated!

Here is an example of the file

This seems super-ugly, since even the XML part lacks the required version header and thus will not considered valid. by decent tools. I would probably start by writing a separate tool/routine extracting the xml part and writing it to a separate file; I guess, that the probability to stumble upon `` in the machine data is pretty low. — guidot, Aug 21 '23 at 09:43
The readable part does not look like proper XML. There are attributes on end tags, which is not allowed. — mzjn, Aug 21 '23 at 09:58
That file is a mess. 1. Scan to what you think is the start of the "XML". 2. Try the techniques given in the duplicate link. 3. Berate the party responsible for generating that data. — kjhughes, Aug 21 '23 at 13:42
Thanks guys for the comments. Maybe my "xml" data is not as bad as i posted it. (fingers crossed) but at the moment i probably have to process the data as text. Well see... — Nathan Seiler, Aug 23 '23 at 06:33

score 0 · Accepted Answer · answered Aug 21 '23 at 09:47

Working with a couple of assumptions:

If we are certain that the <Version> tag is always present
If we are certain that the <Version> tag is always the first tag

Then we could just look for it and discard everything that comes before, something like this:

# Use your file name instead of `data.wve` here
with open("data.wve") as data_file:
    file_content = data_file.read()

# Split the content of the file in 2 parts, starting at the version tag
xml_data = file_content.split("<Version", 1)[1]
# Because the `split` method removes the separator, place it back at the begining of the string
xml_data = "<Version" + xml_data
print(xml_data)

You could then look into python standard library XML processing modules here to parse the remaining XML data.

Except that the stuff that follows `` isn't XML - it has attributes in end tags. — Michael Kay, Aug 21 '23 at 10:00
Thank you for helping me get to the part of the file I want to work on. That the data is not xml I did not realize, as I am a complete newbie in this area. But well figuere something out I guess. — Nathan Seiler, Aug 23 '23 at 05:48

artygo · Answer 2 · 2023-08-21T10:30:26.660

As underlined by mzjn, this document is not an XML, which means you are going to have to parse it manually...

If you still want to extract the "XML like" part, you can proceed like the following:

with open('awesome.wve', 'rb') as f:
    content = f.read()

# the last '>' will be the end of the pseudo_xml
end_of_pseudo_xml = content.rfind(b'>') + 1

# the last tag will be something like </someTag>
ending_tag = content[content.rfind(b'<') : end_of_pseudo_xml]

# the first tag will be something like <someTag> or <someTag some_value="V">
first_tag = ending_tag.replace(b'</', b'<').replace(b'>', b'')

# the pseudo_xml will be something like <someTag...> ... </someTag>
pseudo_xml = content[content.find(first_tag) : end_of_pseudo_xml]

# from bytes to string
pseudo_xml = pseudo_xml.decode()

Note the use of rfind to search the content of the string from the end.

The readable content is not XML (attributes on end tags are not allowed) — mzjn, Aug 21 '23 at 10:05

acessing xml data in machine files

2 Answers2