0

I have a xml file like this:

  "HTTP/1.1 100 Continue
   HTTP/1.1 200 OK
   Expires: 0
   Buffer: false 
   Pragma: No-cache
   Cache-Control: no-cache
   Server: Transaction_Server/4.1.0(zOS)
   Connection: close
   Content-Type: text/html
   Content-Length: 33842
   Date: Sat, 02 Aug 2014 09:27:02 GMT

 <?xml version=""1.0"" encoding=""UTF-8""?>
 <creditBureau xmlns=""http://www.transunion.com/namespace"" xmlns:xsi=""http://www.w3.org/2001/XMLSchema-instance"">

 <document>response</document>
 <version>2.9</version>
 <transactionControl><userRefNumber>Credit Report Example</userRefNumber>
 <subscriber><industryCode>Z</industryCode></subscriber></transactionControl>

This is just a part of the entire document. I want to convert this into json. The problem is how to skip or delete the header part and start parsing from the real xml as in, starting from the <document> tag.

There are more than a million such files. I can't do it manually. How can I do it? Any help appreciated.

mzjn
  • 48,958
  • 13
  • 128
  • 248
Karan Gupta
  • 529
  • 2
  • 7
  • 21

2 Answers2

0

You could use regex to select only the xml part Something like: /<document>(.*)/gs or /"">(.*)/gs

but how are you fetching that website? This looks similar to something I've been doing with curl, but you should be able to get only body out from curl.

Then you use some library to convert xml to json.

For that part you can use something like Converting XML to JSON using Python?

P.S. (I know this would be better as a comment, but i don't have enough reputation so putting it here.)

darthzejdr
  • 314
  • 1
  • 3
  • 12
  • I am not fetching the website. I get this data from transunion. So, what ur saying is that i read it as a text file and use reg exp to select the xml part – Karan Gupta Dec 06 '17 at 08:13
  • That's what i would do. But there might be a better way. I'm not that good with python. – darthzejdr Dec 06 '17 at 08:39
0

You can read each file, remove the unwanted header using a concept as shown below.

import re

file = '''\
"HTTP/1.1 100 Continue
 HTTP/1.1 200 OK
 Expires: 0
 Buffer: false
 Pragma: No-cache
 Cache-Control: no-cache
 Server: Transaction_Server/4.1.0(zOS)
 Connection: close
 Content-Type: text/html
 Content-Length: 33842
 Date: Sat, 02 Aug 2014 09:27:02 GMT

<?xml version=""1.0"" encoding=""UTF-8""?>
<creditBureau xmlns=""http://www.transunion.com/namespace"" xmlns:xsi=""http://www.w3.org/2001/XMLSchema-instance"">

<document>response</document>
<version>2.9</version>
<transactionControl><userRefNumber>Credit Report Example</userRefNumber>
<subscriber><industryCode>Z</industryCode></subscriber></transactionControl>'''

# list concept.
file_list = file.split('\n')
start = file_list.index('<?xml version=""1.0"" encoding=""UTF-8""?>')
new_list = file_list[start:]
print('joined from list:\n', '\n'.join(new_list), sep='')

# regexp concept.
new_string = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', file, flags=re.S)
print('regexp:\n', new_string, sep='')

The regexp might be quicker though you have plenty of files to test with.

Edit:

Use like this on test.xml:

import re

with open('test.xml') as r:
    file = r.read()

new_string = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', file, flags=re.S)

print(new_string)

Edit:

Another example showing bulk overwriting of xml files. Always test first before using on many files. Small test works fine for me.

import glob, re

for file in glob.iglob('*.xml'):
    with open(file) as r:
        current_string = r.read()

    new_string = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', current_string, flags=re.S)

    with open(file, 'w') as w:
        w.write(new_string)

Specify the codec for reading and writing may be needed.

michael_heath
  • 5,262
  • 2
  • 12
  • 22
  • AttributeError: '_io.TextIOWrapper' object has no attribute 'split' Error comes in list concept :( – Karan Gupta Dec 06 '17 at 12:07
  • `TypeError: expected string or bytes-like object` in regexp concept. I am tired of converting types to diff types...Help! – Karan Gupta Dec 06 '17 at 12:08
  • 1st error: [another posted answer](https://stackoverflow.com/questions/17569679/python-attributeerror-io-textiowrapper-object-has-no-attribute-split#17570045) – michael_heath Dec 06 '17 at 12:19