0

I'm working with XML files in Python. My goal is to extract information from specific fields in these files for later use. Apparently, these XML files are well-constructed, as I can open them with a simple text editor and see its structure. The information it contains refers to specific points in a Whole-Slide Image (WSI). Here I show an example (I received this files in a one-line format XML, sorry...).

<?xml version="1.0"?><Annotations file="E:\CONTAJES\CONTAJE 10-03-21\21A3 A25 HE.tif"><SlideAnnotation Text="" Voice=""></SlideAnnotation><GridMap><Mag Factor="1"></Mag><Mag Factor="2"></Mag></GridMap><Counters><Counter name="SANO" r="0" g="255" b="0" diameter="20"><Points></Points></Counter><Counter name="ESCLEROSADO" r="255" g="0" b="0" diameter="20"><Points><Point X="21878" Y="5128"/><Point X="32283" Y="5168"/><Point X="32478" Y="4913"/><Point X="37093" Y="9228"/><Point X="37393" Y="7068"/><Point X="43778" Y="7683"/><Point X="32388" Y="17453"/><Point X="27172" Y="36266"/><Point X="28758" Y="37858"/></Points></Counter><Counter name="SEMILUNAS" r="0" g="0" b="255" diameter="20"><Points></Points></Counter><Counter name="HIPERCELULAR MES" r="255" g="128" b="0" diameter="20"><Points></Points></Counter><Counter name="ENDOCAPILAR" r="255" g="255" b="128" diameter="20"><Points></Points></Counter><Counter name="ISQUEMICO" r="0" g="128" b="0" diameter="20"><Points></Points></Counter><Counter name="GSSF" r="128" g="0" b="255" diameter="20"><Points></Points></Counter><Counter name="INCOMPLETO" r="0" g="0" b="0" diameter="20"><Points></Points></Counter><Counter name="GNMP" r="0" g="128" b="192" diameter="20"><Points></Points></Counter><Counter name="MIXTO" r="128" g="128" b="128" diameter="20"><Points></Points></Counter><Counter name="MEMBRANOSO" r="128" g="64" b="64" diameter="20"><Points></Points></Counter></Counters></Annotations>

I firstly noticed about my problem when I tried to open this file with PyCharm IDE (v2020.2.3), as the content it shows is the same but with a NULL character between each character:

<NUL?NULxNULmNULlNUL NULvNULeNULrNULsNULiNULoNULn [...]

I've tried to open the file in Python as follows (using bs4 and lxml packages):

from bs4 import BeautifulSoup
file = 'sample.xml'
with open(file, 'r') as f:
    data = f.read()
Bs_data = BeautifulSoup(data, 'xml')

If I print data, I get the following output:

'<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00= [...]'

Where \x00 stands for the NULL value aformentioned. Bs_data just contains the first field in the file:

<?xml version="1.0" encoding="utf-8"?>

The encoding field doesn't exists in the original file.

Can this problem be in relation with the encoding format? How can I correctly read the file in Python to collect its data?

For now, as I need a solution ASAP, what I'm doing is simply modifying the data string as follows:

[...]
    data = f.read()
    data = data.replace("\x00", '')
[...]

But I really want to understand the root of the problem.

Thanks in advance for your contributions!

FranmR
  • 76
  • 7
  • 2
    The description sounds vaguely like the file is actually in UTF-16. Try `with open(file, 'r', encoding='utf-16le') as f:` (I see no reason as such to put the file name in a variable, but that is the minimal fix). Perhaps a better fix would be to normalize the files to UTF-8 separately, immediately after saving them. – tripleee Nov 03 '21 at 08:53
  • Thanks! This solved my problem. As you have seen, I'm not use to work with XML files, so I didn't notice the "clues" to realize that the format was UTF-16. – FranmR Nov 03 '21 at 10:34

0 Answers0