I have a large xml file (100GB or more) which contains unordered records, something like this:
<page id = "1">content 1</page>
<page id = "2">...</page>
<page id = "2">...</page>
... many other pages ...
<page id = "1">content 2</page>
... many other pages ...
I have to access data in read only mode, but grouping them by page:
<page id = "1">
content1
content2
...
</page>
<page id = "2">
...
...
</page>
...other pages...
Pages must not be ordered by id.
My solution by now require to pre-process the xml, and for each page:
- open a file with a unique naming convention (e.g. "1.data" for page 1, "2.data", ...)
- append the content of the current page
- close the file
The problem for me is that handling large number of pages, requires the creation of millions of files, which of course not very good.
My question is if it's possible to use some kind of archive file (just like tar or zip) to serialize all my data. The advantage is of having only one big file containing all my data, which can be read sequentially, compression is not necessarily required.
I prefer avoiding using a database, because my software should be standalone and I would prefer using python.
Thanks,
Riccardo