My interpretation of your needs is that you want to be able to parse the partial file and build up the structure of the document as you go. I've taken some assumptions from the file you uploaded:
Fundamentally you want to be parsing collections of things which have similar properties - I'm inferring this from the way you presented your desired output as a table with rows containing the values.
You expect these collections of things to have the same number of values.
You need to be able to parse partial files.
You don't worry about the properties of elements, just their contents.
I'm using xml.sax
as this deals with arbitrarily large files and doesn't need to read the whole file into memory. Note that the strategy I'm following now doesn't actually scale that well as I'm storing all the elements in memory to build the dataframe, but you could just as well output the paths and contents.
In the sample file there is a problem with having one row per Item
since there are multiples of the Genre
tag and there are also multiple Product
tags. I've handled the repeated Genre
tags by appending them. This relies on the Genre tags appearing consecutively. It is not at all clear how the Product
relationships can be handled in a single table.
import xml.sax
from collections import defaultdict
class StructureParser(xml.sax.handler.ContentHandler):
def __init__(self):
self.text = ''
self.path = []
self.datalist = defaultdict(list)
self.previouspath = ''
def startElement(self, name, attrs):
self.path.append(name)
def endElement(self, name):
strippedtext = self.text.strip()
path = '/'.join(self.path)
if strippedtext != '':
if path == self.previouspath:
# This handles the "Genre" tags in the sample file
self.datalist[path][-1] += f',{strippedtext}'
else:
self.datalist[path].append(strippedtext)
self.path.pop()
self.text = ''
self.previouspath = path
def characters(self, content):
self.text += content
You'd use this like this:
parser = StructureParser()
try:
xml.sax.parse('uyalicihow.xml', parser)
except xml.sax.SAXParseException:
print('File probably ended too soon')
This will read the example file just fine.
Once this has read and probably printed "File probably ended to soon", you have the parsed contents in parser.datalist
.
You obviously want to have just the parts which read successfully, so you can figure out the shortest list and build a DataFrame with just those paths:
import pandas as pd
smallest_items = min(len(e) for e in parser.datalist.values())
df = pd.DataFrame({key: value for key, value in parser.datalist.items() if len(value) == smallest_items})
This gives something similar to your desired output:
Items/Item/Main/Platform Items/Item/Main/PlatformID Items/Item/Main/Type
0 iTunes 353736518 TVEpisode
1 iTunes 495275084 TVEpisode
The columns for the test file which are matched here are
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL'],
dtype='object')
Based on your comments, it appears as though it is more important to you to have all the elements represented, but perhaps just showing a preview, in which case you can perhaps use only the first elements from the data. Note that in this case the Product
s entries won't match the Item
entries.
df = pd.DataFrame({key: value[:smallest_items] for key, value in parser.datalist.items()})
Now we get all the paths:
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL',
'Items/Item/Products/Product/Offers/Offer/Price',
'Items/Item/Products/Product/Offers/Offer/Currency'],
dtype='object')