0

I have an xml file it start like:

'''some non ascii character'''
<b:FatturaElettronica xmlns:b="#">
  <FatturaElettronicaHeader>
    <DatiTrasmissione>
      <IdTrasmittente>
        <IdPaese>IT</IdPaese>

i need to remove all until

<FatturaElettronicaHeader>

now the code is:

import xml.etree.ElementTree as ET
import xml.etree.ElementTree as ETree
from lxml import etree

parser = etree.XMLParser(encoding='utf-8', recover=True, remove_comments=True, resolve_entities=False)
tree = ETree.parse('test.xml', parser)

root = tree.getroot()

print etree.tostring(root)

and give me:

Traceback (most recent call last):
  File "xml2.py", line 14, in <module>
    print etree.tostring(root)
  File "src/lxml/etree.pyx", line 3350, in lxml.etree.tostring
TypeError: Type 'NoneType' cannot be serialized.

whitout the first part of xml file it work.

TY

Tom Blodget
  • 20,260
  • 3
  • 39
  • 72
JuConte
  • 513
  • 2
  • 7
  • 18
  • Wouldn't it make more sense to first look why you have a "corrupt" XML file. It is usually safer to solve the problems at the root level, not try to circumvent it, since that could mean something is broken on the generator, and thus will eventually result in wrong files. – Willem Van Onsem Jun 15 '19 at 15:56
  • the original file is a .xml.p7m file, it give always any problem in convertion. I need to ignore the part of file i not need. – JuConte Jun 15 '19 at 15:57
  • Possible duplicate of [extract signed data from pkcs7 in python](https://stackoverflow.com/questions/52344287/extract-signed-data-from-pkcs7-in-python) – Tom Blodget Jun 15 '19 at 16:39
  • BTW—XML doesn't have anything to do with ASCII. All XML characters are Unicode. – Tom Blodget Jun 15 '19 at 16:41
  • If it starts with `'''` then it is not an XML file, so your question contains a basic contradiction. – Michael Kay Jun 15 '19 at 21:23
  • Please update the post with a valid XML. – balderman Jun 16 '19 at 07:53

1 Answers1

0

you could use the find() function to search for the first bracket.

import xml.etree.ElementTree as ET

with open ('...XMLFILE.xml', 'r') as file:
    filestring = file.read()

XML_start = filestring.find('<')
print(XML_start) #gives 31

tree = ET.fromstring(filestring[XML_start:])

for i in tree.iter():
    print(i.tag) #gives {#}FatturaElettronica, FatturaElettronicaHeader, ... 

but also your xml-file has to be correct:

'''some non ascii character'''
<b:FatturaElettronica xmlns:b="#">
  <FatturaElettronicaHeader>
    <DatiTrasmissione>
      <IdTrasmittente>
        <IdPaese>IT</IdPaese>
        </IdTrasmittente>
    </DatiTrasmissione>
</FatturaElettronicaHeader>
</b:FatturaElettronica>
Mig B
  • 637
  • 1
  • 11
  • 19