I'm working with XML files in Python. My goal is to extract information from specific fields in these files for later use. Apparently, these XML files are well-constructed, as I can open them with a simple text editor and see its structure. The information it contains refers to specific points in a Whole-Slide Image (WSI). Here I show an example (I received this files in a one-line format XML, sorry...).
<?xml version="1.0"?><Annotations file="E:\CONTAJES\CONTAJE 10-03-21\21A3 A25 HE.tif"><SlideAnnotation Text="" Voice=""></SlideAnnotation><GridMap><Mag Factor="1"></Mag><Mag Factor="2"></Mag></GridMap><Counters><Counter name="SANO" r="0" g="255" b="0" diameter="20"><Points></Points></Counter><Counter name="ESCLEROSADO" r="255" g="0" b="0" diameter="20"><Points><Point X="21878" Y="5128"/><Point X="32283" Y="5168"/><Point X="32478" Y="4913"/><Point X="37093" Y="9228"/><Point X="37393" Y="7068"/><Point X="43778" Y="7683"/><Point X="32388" Y="17453"/><Point X="27172" Y="36266"/><Point X="28758" Y="37858"/></Points></Counter><Counter name="SEMILUNAS" r="0" g="0" b="255" diameter="20"><Points></Points></Counter><Counter name="HIPERCELULAR MES" r="255" g="128" b="0" diameter="20"><Points></Points></Counter><Counter name="ENDOCAPILAR" r="255" g="255" b="128" diameter="20"><Points></Points></Counter><Counter name="ISQUEMICO" r="0" g="128" b="0" diameter="20"><Points></Points></Counter><Counter name="GSSF" r="128" g="0" b="255" diameter="20"><Points></Points></Counter><Counter name="INCOMPLETO" r="0" g="0" b="0" diameter="20"><Points></Points></Counter><Counter name="GNMP" r="0" g="128" b="192" diameter="20"><Points></Points></Counter><Counter name="MIXTO" r="128" g="128" b="128" diameter="20"><Points></Points></Counter><Counter name="MEMBRANOSO" r="128" g="64" b="64" diameter="20"><Points></Points></Counter></Counters></Annotations>
I firstly noticed about my problem when I tried to open this file with PyCharm IDE (v2020.2.3), as the content it shows is the same but with a NULL character between each character:
<NUL?NULxNULmNULlNUL NULvNULeNULrNULsNULiNULoNULn [...]
I've tried to open the file in Python as follows (using bs4 and lxml packages):
from bs4 import BeautifulSoup
file = 'sample.xml'
with open(file, 'r') as f:
data = f.read()
Bs_data = BeautifulSoup(data, 'xml')
If I print data
, I get the following output:
'<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00= [...]'
Where \x00 stands for the NULL value aformentioned. Bs_data just contains the first field in the file:
<?xml version="1.0" encoding="utf-8"?>
The encoding field doesn't exists in the original file.
Can this problem be in relation with the encoding format? How can I correctly read the file in Python to collect its data?
For now, as I need a solution ASAP, what I'm doing is simply modifying the data
string as follows:
[...]
data = f.read()
data = data.replace("\x00", '')
[...]
But I really want to understand the root of the problem.
Thanks in advance for your contributions!