Format error while opening XML file with Python - NULL (\x00) between characters

Question

I'm working with XML files in Python. My goal is to extract information from specific fields in these files for later use. Apparently, these XML files are well-constructed, as I can open them with a simple text editor and see its structure. The information it contains refers to specific points in a Whole-Slide Image (WSI). Here I show an example (I received this files in a one-line format XML, sorry...).

<?xml version="1.0"?><Annotations file="E:\CONTAJES\CONTAJE 10-03-21\21A3 A25 HE.tif"><SlideAnnotation Text="" Voice=""></SlideAnnotation><GridMap><Mag Factor="1"></Mag><Mag Factor="2"></Mag></GridMap><Counters><Counter name="SANO" r="0" g="255" b="0" diameter="20"><Points></Points></Counter><Counter name="ESCLEROSADO" r="255" g="0" b="0" diameter="20"><Points><Point X="21878" Y="5128"/><Point X="32283" Y="5168"/><Point X="32478" Y="4913"/><Point X="37093" Y="9228"/><Point X="37393" Y="7068"/><Point X="43778" Y="7683"/><Point X="32388" Y="17453"/><Point X="27172" Y="36266"/><Point X="28758" Y="37858"/></Points></Counter><Counter name="SEMILUNAS" r="0" g="0" b="255" diameter="20"><Points></Points></Counter><Counter name="HIPERCELULAR MES" r="255" g="128" b="0" diameter="20"><Points></Points></Counter><Counter name="ENDOCAPILAR" r="255" g="255" b="128" diameter="20"><Points></Points></Counter><Counter name="ISQUEMICO" r="0" g="128" b="0" diameter="20"><Points></Points></Counter><Counter name="GSSF" r="128" g="0" b="255" diameter="20"><Points></Points></Counter><Counter name="INCOMPLETO" r="0" g="0" b="0" diameter="20"><Points></Points></Counter><Counter name="GNMP" r="0" g="128" b="192" diameter="20"><Points></Points></Counter><Counter name="MIXTO" r="128" g="128" b="128" diameter="20"><Points></Points></Counter><Counter name="MEMBRANOSO" r="128" g="64" b="64" diameter="20"><Points></Points></Counter></Counters></Annotations>

I firstly noticed about my problem when I tried to open this file with PyCharm IDE (v2020.2.3), as the content it shows is the same but with a NULL character between each character:

<NUL?NULxNULmNULlNUL NULvNULeNULrNULsNULiNULoNULn [...]

I've tried to open the file in Python as follows (using bs4 and lxml packages):

from bs4 import BeautifulSoup
file = 'sample.xml'
with open(file, 'r') as f:
    data = f.read()
Bs_data = BeautifulSoup(data, 'xml')

If I print data, I get the following output:

'<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00= [...]'

Where \x00 stands for the NULL value aformentioned. Bs_data just contains the first field in the file:

<?xml version="1.0" encoding="utf-8"?>

The encoding field doesn't exists in the original file.

Can this problem be in relation with the encoding format? How can I correctly read the file in Python to collect its data?

For now, as I need a solution ASAP, what I'm doing is simply modifying the data string as follows:

[...]
    data = f.read()
    data = data.replace("\x00", '')
[...]

But I really want to understand the root of the problem.

Thanks in advance for your contributions!

The description sounds vaguely like the file is actually in UTF-16. Try `with open(file, 'r', encoding='utf-16le') as f:` (I see no reason as such to put the file name in a variable, but that is the minimal fix). Perhaps a better fix would be to normalize the files to UTF-8 separately, immediately after saving them. — tripleee, Nov 03 '21 at 08:53
Thanks! This solved my problem. As you have seen, I'm not use to work with XML files, so I didn't notice the "clues" to realize that the format was UTF-16. — FranmR, Nov 03 '21 at 10:34

Format error while opening XML file with Python - NULL (\x00) between characters

0 Answers0