1

I noticed there is a "large_record_size" option with ZODB.DB, and I took it as a possibility of storing long texts in ZODB.

The first time I tried storing a corpus of texts (the total size of this corpus is 59.1 MB, 6000 texts and the length of the longest one was 82 KB), with "large_record_size" option set to 16777216, I started simply with connection.root(), and a warning was issued reporting the size of root PersistentMapping and saying that it was probably a bad idea to store an object this large.

Then I tried using an OOBTree to store the same lot of texts. No warning this time. The resulting database file was 59.2 MB, ideally small. I tested this file by randomly retrieving the texts in it. By the way, the retrieval speed is fairly fast. Apparently everything was how I had wanted it. However, I am a newbie to programming, I don't think I have enough understanding to make safe judgments.

Is ZODB a decent solution for storing texts?

Any suggestion would be appreciated.

DingZh
  • 69
  • 7

1 Answers1

0

The option is merely used to control when a warning is issued:

When data records are large, a warning is issued to try to prevent new users from shooting themselves in the foot.

>>> db = ZODB.DB('t.fs', create=True)
>>> conn = db.open()
>>> conn.root.x = 'x'*(1<<24)
>>> ZODB.tests.util.assert_warning(UserWarning, transaction.commit,
...    "object you're saving is large.")
>>> db.close()

The large_record_size is used to set the threshold, the default is 1<<30, or 1GB.

Over this size, you should either use ZODB Blobs or split up the data into smaller persistent records, as changes to large homogeneous records will lead to huge churn when committed. See a previous answer of mine: when to commit data in ZODB.

The warning is issued for your PersistentMapping because it stores all keys and values in one record. It is not the individual sizes of the text documents that count here, it is the size of (the pickles of) all text documents added together that triggers the warning here.

Either store your text documents in the PersistentMapping as a subclass of Persistent (so that values get their own record in the ZODB), or use a BTree.OOBTree object.

See Advanced ZODB for Python Programmers.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank you for your very thorough and very instructive answer Martijin. I tried blobs but I gave up using it because the retrieval performance is kind of poor. I think I need to find a way to figure out how long is too long for text documents, to avoid "huge churn when committed" or anything for that matter. – DingZh Oct 25 '13 at 19:27
  • 1
    @user2871934: create a subclass of `Persistent` for your text data then. `class TextDocument(Persistent):` and store instances of *that* in your mapping. – Martijn Pieters Oct 25 '13 at 19:32
  • @user2871934: Each text document then will have its own record in the ZODB and the `PersistentMapping` parent will only have to store keys and references to the value records. – Martijn Pieters Oct 25 '13 at 19:37