4

For a social network site, I need to store frequently modified lists for each entity(& millions of such entities) which are:

  • frequently appended to
  • frequently read
  • sometimes reduced
  • lists are keyed by primary key

I'm already storing some other type of data in an RDBMS. I know that I could store those lists in an RDBMS as a many to many relationship like this way: Create a table listItems with two columns listId & listItem & to generate any particular list, just do a SELECT query for all records WHERE listId = x. But storing lists this way in an RDBMS is not very ideal when high scalability is concerned. Instead I would like to store prepared lists in a natural way, so that retrieval performance is maximized. Because I need to fetch around hundred of such lists for a user, whenever I user does login & view a page.

So how do I solve this ? What kind of database should be used for this data, probably the one that provide adding variable no of columns to keyed by a primary key, the ones like Cassandra ?

Rajat Gupta
  • 25,853
  • 63
  • 179
  • 294

7 Answers7

5

I used the same method that is, to store a 2 column row for every record, which I turned to a txt file with the formatted html which then we changed to json and finally to mongodb.

But since you have got frequent operations, I suggest cassandra, hbase and googles big table implementations like accumulo cloudata and hypertable.

Cloudata may be the right one for you.

Ajmal M Sali
  • 598
  • 6
  • 14
3

As you pointed out the solution must be performant and scaleable: I'd suggest you to use Redis with it's LIST data structure and O(1) inserts and O(N) fetches (N - elements to fetch, considering you're fetching last ones from lists) and scale it horizontally with some hashing algorithm. I don't know what amount of data you are going to store and how many machines are available, but definitely it will be the best choice performance-wise, since nothing beats memory access speed.

If the amount of data is huge and you can't keep it all in RAM then Cassandra can do the job - storing lists ordered by time is a nice fit for it even better with partition strategy as Zanson mentioned above.

One more thought: you said read performance must be max, and once user logs in you will need to fetch hundred of lists for this user. Why not to prepare a single list for each user? That way there will be more writes but the read will be optimized since you will need to fetch only latest entries from one list. I'm not sure if that fits your task, just a thought. :)

Max Ivanov
  • 5,695
  • 38
  • 52
2

I would recommend SSDB(https://github.com/ideawu/ssdb), a Google leveldb network wrapper. SSDB is designed to store collection data, such as list, map, zset(sorted set). You can use it like this way:

ssdb->hset(listId, listItem1);
ssdb->hset(listId, listItem2);
ssdb->hset(listId, listItem3);
...
list = ssdb->hscan(listId, 100);
// now list = [listItem1, listItem2, listItem3, ...]

The number of items in one map is only limited to the size of hard disk. Another solution is Redis, but Redis stores all data into memory(say no more than 30GB), so it probably won't fit your project.

C++, PHP, Python, Java, Lua, and more clients are supported by SSDB.

ideawu
  • 2,287
  • 1
  • 23
  • 28
  • my lists are dynamically modified over time & I need very good read performance, can SSDB support all that?. – Rajat Gupta Aug 27 '13 at 16:12
  • what are its advantages over popular databases like Cassandra & Redis which can also serve for this data ? – Rajat Gupta Aug 27 '13 at 16:16
  • You can use hdel() to delete items in a list, SSDB has very good read performance, 30000+ reads per second. SSDB has very similar APIs to Redis, but its capacity is 100 times of Redis', up to TBs. Cassandra is a distributed storage system, but SSDB is not. SSDB is fast, light-weighted, easy to install and use. – ideawu Aug 27 '13 at 17:21
2

Cassandra has native support for storing sets/maps/lists. If your queries will always be pulling the whole thing down, then they are a very easy way to deal with this type of thing.

http://www.datastax.com/dev/blog/cql3_collections http://cassandra.apache.org/doc/cql3/CQL.html#collections

If your lists are tied to a user, you can make the different columns on the users row/partition, and then queries for the multiple lists will be fast, as they will all be in the same partition for a given user.

Zanson
  • 3,991
  • 25
  • 31
1

Cassandra can be used very well for such use cases. Create as many Columnfamilies as you want for the returned data sets/queries. Cassandra works best with de-normalized data or sets like 1:m, m:m relations.

Mata
  • 439
  • 2
  • 10
1

I know you didn't want to consider relational databases, but I think for this simple situation, there is also a scalable solution with relational database. The main benefit would be that you don't need to maintain a separate database system.

To gain scalability, all NoSQL solutions will distribute your data across multiple nodes. You can do this in your application code, spreading your data out across multiple relational databases. To keep the load balanced, you may need to move data occasionally, but it may be sufficient to simply spawn a new database for every N lists.

John Tseng
  • 6,262
  • 2
  • 27
  • 35
  • It's not that I don't want to, but actually it's unnecessarily complicated storing such data in RDBMS. my first preference is always mysql. – Rajat Gupta Sep 03 '13 at 05:19
0

In cassandra you can have wide rows, up to 2B columns per row... if that's enough for an entity's cumulative lists' item, you can store whole entity's lists in a single row then retrieve it all together. with cassandra's "composite column" you can store elements of each list sequentially and ordered and you can delete a single column(a list item) when you want, and when you have an insertion you just need to insert a column...

something like this: (!)

      |list_1_Id : item1Id |list_1_Id : item2Id | list_2_Id : item1Id |...| list_n_Id : item3Id |
entity|     item1Value     |     item2Value     |     item1Value      |...|     item3Value      |

so practically you deal with columns(=items) rather than lists... and it makes your work much easier. depends on your lists size cosider using spliting entiti's row to multiple rows... something like this: (!)

                 |  item1Id   |  item2Id   |  item3Id   |  item4Id   |...
entiId_list_1_Id | item1Value | item2Value | item3Value | item4Value |...
                 |  item1Id   |  item2Id   |  item3Id   |  item4Id   |...
entiId_list_2_Id | item1Value | item2Value | item3Value | item4Value |...
...

and you can put itemValue in column name and leave column value empty to reduce size... for example you can insert a new item by simply doing: //columns are sorted by their id if they have any insert into entityList[entityId][listId][itemId] = item value; or //columns are sorted by their value insert into entityList[entityId][listId][itemvalue] = nothing; and delete: delete from entityList where entityId='d' and listId='o' and itemId='n';

or via you application you can do it by using a rich client like Hector...