2

I'm trying to find a database to function as a Python set. This is because my data is way too big to be stored in memory.

I tried using SQLite, but heard that it might have performance issues for > ten gigabytes of data, so I'm looking at trying CouchDB

Problem is that it seems to work like a dict, not like a set?

Is there a database tool that functions as a Python set? That is, it just stores values and not key-value pairs?

(I have to code in Python so I'm interested in something that is easy to use with Python)

Edit:

I will store it as one giant set, not several small ones.

Community
  • 1
  • 1
The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • If you're aiming to store single sets with more than 10GB, you probably should not be using python's. Also, if you have multiple sets and each one is relatively small, have you considered a flat file database? – loopbackbee Nov 20 '12 at 07:07
  • I will store it as one great set. Also hoping for something quick as this will performance critical. – The Unfun Cat Nov 20 '12 at 07:12
  • That complicates things. What kind of access patterns are you expecting? – loopbackbee Nov 20 '12 at 07:14
  • Insert one/lookup one alternating. After a while I might not insert any more if the value is already there. – The Unfun Cat Nov 20 '12 at 07:34

3 Answers3

1

Redis can store Set data types:
http://redis.io/topics/data-types

It has a python client.

jdi
  • 90,542
  • 19
  • 167
  • 203
  • http://redis.io/topics/faq : "I like Redis high level operations and features, but I don't like that it takes everything in memory and I can't have a dataset larger the memory. Plans to change this?" – The Unfun Cat Nov 20 '12 at 07:32
1

A key/value store acts like a dict, but that's pretty much how set is implemented anyway, according to the main answer of How is set() implemented?. Why not just use a small dummy value, and do your set operations on the keys?

Community
  • 1
  • 1
acjay
  • 34,571
  • 6
  • 57
  • 100
  • Straightforward solution. But will NoSQL dbs also implement sets and dicts the same way, necessarily? – The Unfun Cat Nov 20 '12 at 07:20
  • 1
    Not necessarily, I would say, but typically both data structures will use hash tables. However, a key/value store may not be ready for such a massive number of keys. Most (all?) types of hash tables have severely degraded performance once you exceed their optimum capacity – loopbackbee Nov 20 '12 at 07:26
  • That drawback would presumably apply to sets in these systems as well. I can't really think of a reason that a database would choose a more optimal/scalable design for sets, but not for maps. But of course you're right that no matter what solution OP settles on, they need to make sure it's designed to scale to the set size they want, and using a map to implement a set leaves room for optimization. – acjay Nov 20 '12 at 07:43
  • @goncalopp Any risk of that happening with 70 000 000 unique values stored? – The Unfun Cat Nov 20 '12 at 08:12
  • @acjohnson55 You're right, of course, they're fundamentally the same (though I would expect a typical database to have a higher default optimum capacity on its values than its keys) – loopbackbee Nov 20 '12 at 10:35
  • @TheUnfunCat I haven't really worked with NoSQL, so I can't really tell. If you really want to go the database/key-value-store route, I'd say you should try them to see how they scale. The other obvious solution that you didn't mention is to implement your own datastructure: [it has been done before](http://stackoverflow.com/questions/495161/fast-disk-based-hashtables). You should probably ask a new question for this though – loopbackbee Nov 20 '12 at 10:40
0

Why don't you make a collection with set values used as unique key?

UPD: for example, you have document like this:

{
    _id: "someid",
    youset: {val1, val2, val3},
}

You can create a new collection like:

{
    _id: val1,
    owner: "someid"
}
{
    _id: val2,
    owner: "someid"
}
{
    _id: val3,
    owner: "someid"
}
...

Since you don't need whole data at the same time, there is no need to embed it inside main document.

ekini
  • 100
  • 7