13

I have a project with 2 applications ( books and reader ).

Books application has a table with 4 milions of rows with this fields:

 book_title = models.CharField(max_length=40)
 book_description = models.CharField(max_length=400)

To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).

In "reader" application, I am thinking to insert this fields:

reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()

So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").

Is a good solution to avoid to query a table with 4 milions of rows ?

Is there an alternative solution?

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
xRobot
  • 25,579
  • 69
  • 184
  • 304
  • 8
    4 million isn't a lot, you have a case of premature optimisation. – Tobu Jan 12 '10 at 19:18
  • The table is queried from ajax using an autocomplete field with this query in the views.py: books.objects.filter(book_title__istartswith=request.GET['q'])[:100] – xRobot Jan 12 '10 at 20:21
  • If you are querying a table on a text field and performance is the problem, you may choose to implement full-text search. However your queried field's size is 40 chars only, and I am not sure if it poses a big problem for the db. – shanyu Jan 12 '10 at 20:30
  • It's not the question you asked, but you may find this helpful: http://stackoverflow.com/questions/1566717/postgresql-like-query-performance-variations/1566762#1566762 – Tobu Jan 12 '10 at 20:56

7 Answers7

19

Like many have said, it's a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). Databases are made to handle tables of this size, so your performance problem is probably somewhere else.

Indexes are the first step, it sounds like you've done this though. 4 million rows should be ok for the db to handle with an index.

Second, check the number of queries you're running. You can do this with something like the django debug toolbar, and you'll often be surprised how many unnecessary queries are being made.

Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. This is where you will see your biggest performance boost for the little effort required.

If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). Please don't use genre to split the tables! pick something that you wont ever, ever change and that you'll always know when making a query. Like author and divide by first letter of the surname or something. This is a lot of effort and has a number of drawbacks for a database which isn't particularly big --- this is why most people here are advising against it!

[edit]

I left out denormalisation! Put common counts, sums etc in the eg author table to prevent joins on common queries. The downside is that you have to maintain it yourself (until django adds a DenormalizedField). I would look at this during development for clear, straightforward cases or after caching has failed you --- but well before sharding or horizontal partitioning.

Will Hardy
  • 14,588
  • 5
  • 44
  • 43
  • Ok if I will split the table, I will split it by first letter... indeed is more reasonable :) The table is queried from ajax using an autocomplete field with this query in the views.py: books.objects.filter(book_title__istartswith=request.GET['q'])[:100] So do you recommend me index + memcached ? Thanks – xRobot Jan 12 '10 at 20:30
  • Make an index on the fist three letters of title (or whatever is the first number you start querying the database) and it will run quite fast. – naivists Jan 12 '10 at 20:41
13

ForeignKey is implemented as IntegerField in the database, so you save little to nothing at the cost of crippling your model.

Edit: And for pete's sake, keep it in one table and use indexes as appropriate.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • I am using index but the table has 4 milions of rows and it's often queried. So I don't know if the index is enough :-\ – xRobot Jan 12 '10 at 19:22
  • If it isn't then the database needs more memory. – Ignacio Vazquez-Abrams Jan 12 '10 at 20:16
  • 2
    4 million rows is nothing to sneeze at, but databases are built for this kind of thing, especially if you're indexing. I would only worry about it if you are getting up to at least a hundred million rows. – LeafStorm Jan 12 '10 at 23:53
2

You haven't mentioned which database you're using. Some databases - like MySQL and PostgreSQL - have extremely conservative settings out-of-the-box, which are basically unusable for anything except tiny databases on tiny servers.

If you tell us which database you're using, and what hardware it's running on, and whether that hardware is shared with other applications (is it also serving the web application, for example) then we may be able to give you some specific tuning advice.

For example, with MySQL, you will probably need to tune the InnoDB settings; for PostgreSQL, you'll need to alter shared_buffers and a number of other settings.

Dan Fairs
  • 99
  • 2
1

I'm not familiar with Django, but I have a general understanding of DB.

When you have large databases, it's pretty normal to index your database. That way, retrieving data, should be pretty quick.

When it comes to associate a book with a reader, you should create another table, that links reader to books.

It's not a bad idea to divide the books into subjects. But I'm not sure what you mean by having 20 applications.

Steven
  • 19,224
  • 47
  • 152
  • 257
  • 20 applications means 20 tables :) I am using already index but the table has 4 milions of rows and it's often queried. So I don't know if the index is enough :-\ – xRobot Jan 12 '10 at 19:21
  • 1
    Gods no! Do not split it up into 20 tables! Make an EAR diagram of you tables, and you will see what extra table you need to handle this. E.g. extra tables for relating person to books, category to books and so on. – Steven Jan 12 '10 at 20:31
1

Are you having performance problems? If so, you might need to add a few indexes.

One way to get an idea where an index would help is by looking at your db server's query log (instructions here if you're on MySQL).

If you're not having performance problems, then just go with it. Databases are made to handle millions of records, and django is pretty good at generating sensible queries.

Seth
  • 45,033
  • 10
  • 85
  • 120
  • Yes, its a performance problem. I am using index but the table has 4 milions of rows and it's often queried. So I don't know if the index is enough :-\ – xRobot Jan 12 '10 at 19:19
  • A bigger index (more columns) is probably the way to go, maybe in addition to memcached as @jcm mentions. Often times single column indexes don't help because they don't get used by your queries. – Seth Jan 12 '10 at 20:42
1

A common approach to this type of problem is Sharding. Unfortunately it's mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. However, I'm not sure 4 million rows is really all that bad. Your queries should still be entirely manageable.

Perhaps you should look in to caching with something like memcached. Django supports this quite well.

Jim Mitchener
  • 8,835
  • 7
  • 40
  • 56
0

You can use a server-side datatable. If you can implement a server-side datatable, you will be able to have more than 4 million records in less than a second.

pujan rai
  • 9
  • 2