0

I have a large xml file (100GB or more) which contains unordered records, something like this:

<page id = "1">content 1</page>
<page id = "2">...</page>
<page id = "2">...</page>
... many other pages ...
<page id = "1">content 2</page>
... many other pages ...

I have to access data in read only mode, but grouping them by page:

<page id = "1">
    content1
    content2
    ...
</page>
<page id = "2">
    ...
    ...
</page>
...other pages...

Pages must not be ordered by id.

My solution by now require to pre-process the xml, and for each page:

  • open a file with a unique naming convention (e.g. "1.data" for page 1, "2.data", ...)
  • append the content of the current page
  • close the file

The problem for me is that handling large number of pages, requires the creation of millions of files, which of course not very good.

My question is if it's possible to use some kind of archive file (just like tar or zip) to serialize all my data. The advantage is of having only one big file containing all my data, which can be read sequentially, compression is not necessarily required.

I prefer avoiding using a database, because my software should be standalone and I would prefer using python.

Thanks,

Riccardo

riccardo.tasso
  • 978
  • 2
  • 10
  • 28
  • A database is probably the logical way to do it. Look at [sqlite](http://docs.python.org/library/sqlite3.html), which is integrated with Python. – Thomas K Feb 18 '12 at 16:11
  • Probably i should use something like fseek: http://stackoverflow.com/questions/508983/how-to-overwrite-some-bytes-in-the-middle-of-a-file-with-python – riccardo.tasso Feb 20 '12 at 16:23
  • You'll end up trying to write a database in Python. You won't make it any faster or better than using a real database. Python comes with a simple database (sqlite) ready to use. – Thomas K Feb 20 '12 at 18:12

1 Answers1

0

I suggest you use sqlite3 module which is in python standard library.
A single sqlite database file can hold 128TB data which is >>200GB.
It's much convenient to manage all these data instead of millions of files.

kev
  • 155,172
  • 47
  • 273
  • 272
  • Something like: create table Page(name text, content text); create index name_idx ON Discussion(name); for each cur_name, cur_content in xml: update Page set content = content || cur_content where name = cur_name; Do you think this can be as efficient as writing on files? – riccardo.tasso Feb 18 '12 at 18:11