XML parsing in Python 3

Question

I have a xml file like this which is a series in the following DF

 userid | fid | response
 -----------------------
 1125 | 58940 | xml1
 3344 | 47839 | xml2
 3455 | 12335 | xml3

The response column contains xml files like this

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Expires: 0
Buffer: false
Pragma: No-cache
Cache-Control: no-cache
Server: IBM_CICS_Transaction_Server/4.1.0(zOS)
Connection: close
Content-Type: text/html
Content-Length: 33842
Date: Sat, 02 Aug 2014 09:27:02 GMT

<?xml version="1.0" encoding="UTF-8"?><creditBureau xmlns="http://www.transunion.com/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><document>response</document><version>2.9</version><transactionControl><userRefNumber>Credit Report Example</userRefNumber>

This is just a part of the entire document. I have to parse this big xml and convert it into json. The first problem I am having is to parse this file. My current code looks like this:

 import pandas as pd
 import re

 raw_data = pd.read_csv('C:\\Users\\Desktop\\xml.csv', sep = '|')
 df = pd.DataFrame(raw_data, columns = ['userid', 'fid', 'response'])
 file = open("testfile.txt", "w")
 file.write(df.loc[0][2])
 file.close()

 #Adding Root Element
 with open("testfile.txt") as f:
     file = f.read()
     file_list = file.split('\n')
 file_list[12] = '<root>'
 file_list.append('</root>')
 start = file_list.index('<root>')
 new_list = file_list[start:]

 #Converting to String
 str1 = ''.join(new_list)
 f = open("tocng.xml","w")
 f.write(str1)

 #parsing xml
 import xml.etree.ElementTree as ET
 tree = ET.parse('tocng.xml')
 ### Gives an error:XML or text declaration not at start of entity: line 1, column 6

I don't understand what is the problem here.

Already answers on Stackoverflow : ================ https://stackoverflow.com/questions/36020697/xml-or-text-declaration-not-at-start-of-entity-line-2-column-0-when-calling ===================================== https://www.drupal.org/project/services/issues/1452858 — Anup Yadav, Dec 13 '17 at 06:21
@AnupYadav I have already checked it. The solution does not work. It still gave me the same error. — Karan Gupta, Dec 13 '17 at 06:23
You you writing all the HTML headers to the `testfile.txt` file? Why are there HTTP headers in the first place? Is there *really* a CSV file that contains full HTTP responses in the third column? because that's a highly unusual format choice that a programmer normally would not pick. Do you get that from an external source or are you creating it yourself? If you are creating it yourself, why? So many questions... — Tomalak, Dec 13 '17 at 07:52
Your edited single line xml is missng a closing tag ``. I will add an answer based on the xml with it — michael_heath, Dec 13 '17 at 10:40
Yeah because it’s not the complete file. The file ends with the “credit bureau” tag. — Karan Gupta, Dec 13 '17 at 10:50
It looks like you are trying to parse a file, tocng.xml, that is not a well-formed XML file. If that file contains HTTP headers and/or whitespace before the XML declaration, then the "XML or text declaration not at start of entity" error is not surprising. — mzjn, Dec 13 '17 at 10:56
@Karan Gupta, if the xml ends with `` then may need to keep the opening `` tag. I did not know this so the "" may need to be replaced with a single " in that tag to mkae it valid. — michael_heath, Dec 13 '17 at 11:00
@michael_heath `` is like the root. The problem we have come down to is to add a root between ` ` and ``. — Karan Gupta, Dec 13 '17 at 11:02
@Tomalak I am just given more than 2mil files like this.I have no clue where they come from. I just have to put all in mongodb. What I am trying to do here is parse the `whateverFormat.xml` and convert it to json. — Karan Gupta, Dec 13 '17 at 11:05
@mzjn Well, I am new to python. Please, if you have solution, kindly post it. — Karan Gupta, Dec 13 '17 at 11:06
I don't have a solution, but it seems like you need to solve this issue first: https://stackoverflow.com/q/47669501/407651. — mzjn, Dec 13 '17 at 11:30

michael_heath · Accepted Answer · 2017-12-14T08:57:15.680

Initial version

import pandas as pd
import re

raw_data = pd.read_csv('C:\\Users\\Desktop\\xml.csv', sep = '|')
df = pd.DataFrame(raw_data, columns = ['userid', 'fid', 'response'])
file = open("testfile.txt", "w")
file.write(df.loc[0][2])
file.close()

#Adding Root Element
with open("testfile.txt") as f:
    file = f.read()
    file_list = file.split('\n')
file_list[13] = '<root>' ### 12 to 13 to overwrite deformed <creditBureau ...>
file_list.append('</root>')
start = file_list.index('<root>')
new_list = file_list[start:]

#Converting to String
str1 = ''.join(new_list)
f = open("tocng.xml","w")
f.write(str1)
f.close() ### close file handle so ET can read it

#parsing xml
import xml.etree.ElementTree as ET
tree = ET.parse('tocng.xml')

Couple of issues:

You left a declaration element after <root> tag. Removed declarations.
Declarations have "" which deforms them. Works without the declarations.
Write to tocng.xml and you do not close the file handle. Makes ET reading fail.

If you want to use the regex I offered in a previous post then try this as it will remove the header without any line count needed. <root> then stats from index 1.

import pandas as pd
import re

raw_data = pd.read_csv('C:\\Users\\Desktop\\xml.csv', sep = '|')
df = pd.DataFrame(raw_data, columns = ['userid', 'fid', 'response'])
file = open("testfile.txt", "w")
file.write(df.loc[0][2])
file.close()

#Adding Root Element
with open("testfile.txt") as f:
    file = f.read()
    file = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', file, flags=re.S)
    file_list = file.split('\n')
file_list[1] = '<root>'
file_list.append('</root>')
start = file_list.index('<root>')
new_list = file_list[start:]

#Converting to String
str1 = ''.join(new_list)
f = open("tocng.xml","w")
f.write(str1)
f.close() ### close file handle so ET can read it

#parsing xml
import xml.etree.ElementTree as ET
tree = ET.parse('tocng.xml')

Based on new single line xml

More variations in the xml files may need adjustment to the code.

import pandas as pd
import re

raw_data = pd.read_csv('C:\\Users\\Desktop\\xml.csv', sep = '|')
df = pd.DataFrame(raw_data, columns = ['userid', 'fid', 'response'])
file = open("testfile.txt", "w")
file.write(df.loc[0][2])
file.close()

#Adding Root Element
with open("testfile.txt") as f:
    file = f.read()
    # Replace up to <?xml tag.
    file = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', file, flags=re.S)
    # Strip file and add \n at each instance of >.
    file = file.strip()
    file = file.replace('>', '>\n')
    # Split file and make a list with no empty items.
    file_list = file.split('\n')
    file_list = [item for item in file_list if item != '']
    # Remove known xml declarations.
    if file_list[0][:5] == '<?xml':
        del file_list[0]
    if file_list[0][:13] == '<creditBureau':
        del file_list[0]
    # Add root tags.
    file_list.insert(0, '<root>')
    file_list.append('</root>')

#Converting to String
str1 = ''.join(file_list)
print(str1) ## See output in my answer
with open("tocng.xml","w") as f:
    f.write(str1)

#parsing xml
import xml.etree.ElementTree as ET
tree = ET.parse('tocng.xml')

Output:

<root><document>response</document><version>2.9</version><transactionControl><userRefNumber>Credit Report Example</userRefNumber></transactionControl></root>

Edit: If <creditBureau...> is needed, then remove or comment these lines:

    if file_list[0][:13] == '<creditBureau':
        del file_list[0]

The last xml posted does not look malformed with the initial tags so no changes done to handle it.

Regex customization of xml string

Using regex with no use of lists to handle the xml string as the xml could be multiline or single line.

In the customize_xml function, regex comments show group numbers and the pattern modes which you can pass as arguments to the customize_xml function. Valid mode arguments are one of None, -1, 0, 1, 2, 3, 4.

import pandas as pd
import re


def customize_xml(content, mode=0):
    '''Customizes xml tags in the content and may insert a <root> tag.'''

    # No modification.
    if mode in (-1, None):
        return content

    # Select a pattern (mode) that modifies the whole xml.
    pattern = (r'\2\3\4',                # 0. <cB>...</cB>
               r'<root>\2\3\4</root>',   # 1. <root><cB>...</cB><root>
               r'\1<root>\2\3\4</root>', # 2. <?xml?><root><cB>...</cB><root>
               r'<root>\3</root>',       # 3. <root>...<root>
               r'\1<root>\3</root>',     # 4. <?xml?><root>...<root>
               )

    # Groups are marked as \1 \2 ... to use for pattern above.
    content = re.sub(r'(<\?xml.+?\?>)'      # \1
                      '(<creditBureau.*?>)' # \2
                      '(.+?)'               # \3
                      '(</creditBureau>)'   # \4
                     , pattern[mode], content, flags=re.S)

    return content


raw_data = pd.read_csv('C:\\Users\\Desktop\\xml.csv', sep = '|')
df = pd.DataFrame(raw_data, columns = ['userid', 'fid', 'response'])
with open("testfile.txt", "w") as f:
    f.write(df.loc[0][2])

with open("testfile.txt") as f:
    file = f.read()

# Remove characters up to <?xml tag.
file = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', file, flags=re.S)

# Make xml one single line if not already.
file = file.replace('\n', '')

file = customize_xml(file, 3)

# Write customized xml.
with open("tocng.xml", "w") as f:
    f.write(file)

# Parsing xml.
import xml.etree.ElementTree as ET
tree = ET.parse('tocng.xml')

# Print pretty xml from xml string.
from xml.dom import minidom
pretty_xml = minidom.parseString(file).toprettyxml(indent="    ")
print(pretty_xml)

Added a pretty print at end. Optional, just for reviewing the result.

`with open("testfile.txt") as f: file = f.read() file_list = file.split('\n') file_list[13] = '' file_list.append('') start = file_list.index('') new_list = file_list[start:] new_list` output: `['', '']` if file_llist[13] is made the root then see the output of new_list — Karan Gupta, Dec 13 '17 at 09:06
@Karan Gupta, new_list prints: `['', 'response', '2.9', 'Credit Report Example', 'Z', '']` — michael_heath, Dec 13 '17 at 09:35
Its surprising because with me it prints `['', '']`. So, I did `file_list.insert(13,'')`. — Karan Gupta, Dec 13 '17 at 09:36
if I use that, I get the `` item after ``. If you get that item, it can deform the xml as it should be above root and should not have doubled double quotes. — michael_heath, Dec 13 '17 at 09:46
Actually the xml I posted in the question is not the actual one. I inserted new lines to make it readable. I am editing the question to give you the exact xml. — Karan Gupta, Dec 13 '17 at 09:49
I will be able to check this solution day after only. Will post soon. Thanks. — Karan Gupta, Dec 13 '17 at 11:55
@Karan Gupta, added *Regex customization of xml string* which may be worth testing. Does not rely on splitting lines and using lists which may be unreliable for this task. The tag `creditBureau` has some namespace attributes so I am not sure if inserting tag `root` before it is good, so I set mode of `customize_xml()` to 3 which outputs lean xml without tag `creditBureau`. Try it. — michael_heath, Dec 14 '17 at 09:09
In the list solution, you deleted the first `` but not the last ``.To remove that, I first check the index and then remove that too? Same way as you deleted the first one? You are right, I don't need the `creditbureau` tag as we are adding a root. The last element in the list is `",` which also needs to be removed. — Karan Gupta, Dec 15 '17 at 07:00
I checked the list method and it works perfectly. Even the regex solution will work. The regex solution is very nice. I couldn't have thought about it. The function is great. Now, that I have the clean xml, I need to add it to the dataframe which was initially made and then convert it into json. So that the dependency is in tact, as in, the userid should have its corresponding response file. — Karan Gupta, Dec 15 '17 at 09:10

XML parsing in Python 3

1 Answers1

Initial version

Based on new single line xml

Regex customization of xml string