4

Search engines (or similar web services) use flat file and nosql databases. The structure of an Inverted Index is simpler than many-to-many relationship, but it should be more efficient to handle it with the latter one. There should be two tables for few billions of webpages and millions of keywords. I have tested for a table of 50 million row; the speed of mysql can be comparable with that of BerkeleyDB.

I think the problem of working with large mysql database appears when dealing with something like ALTER TABLE (which is not a case here). This performance is read-intensive in which mysql is quite good. When reading a row by SELECT I did not find a singificant difference between a table with few rows or few million rows; does it different when having billions of row?

NOTE: I do not mean Google or Bing (or advanced features like full-text search), I am discussing the concept.

Googlebot
  • 15,159
  • 44
  • 133
  • 229
  • http://stackoverflow.com/questions/2559411/sql-mysql-vs-nosql-couchdb – Jauzsika Oct 16 '11 at 10:39
  • @Jauzsika: Not only here, there are many articles comparing sql and no-sql databases; but my question is connected with a specific application for search engines. Moreover, most of comparisons are in favor of sql databases; and I am asking why nosql, key/value, databases are winner here. – Googlebot Oct 16 '11 at 10:42
  • You are comparing MySQL specifically: not RDBMS generally – gbn Oct 16 '11 at 10:43
  • @gbn: I am referring to mysql as a famous RDBMS; it can also be Oracle. I just tested mysql as I am familiar with it but not others. – Googlebot Oct 16 '11 at 10:45
  • @Ali There is the answer for your question in the article I linked. Because mysql does a lot of things (without asking you or giving you the option to turn off) search engine dbs do not need. That's why nosql and the likes are much better in performance. – Jauzsika Oct 16 '11 at 10:51
  • @Jauzsika, the answer marked as correct, appends 2 articles, one of which is this one: http://www.yafla.com/dforbes/The_Impact_of_SSDs_on_Database_Performance_and_the_Performance_Paradox_of_Data_Explodification , which kind of puts the argument of speed to a test, and tested it was. – Alex Oct 16 '11 at 11:26
  • @Jauzsika:Definitely, it is better to have what we need to use; but the point is that the performance of BerkeleyDB is not better than Mysql (two typical examples). Unless, the case is different for billions of row - this is my question! – Googlebot Oct 16 '11 at 11:27

1 Answers1

2

AFAIK, nosql provides flexibility which no other regular relational database engine offers. I don't know which search engines use which database engine, but I could think of several benefits of using nosql (not flat files. Have no idea why one would use them for complex applications).

Now if you're just matching criteria and giving out results without a particular order - you're fine with any relational database. But once you want to provide the most relevant results, there are tons of criteria to take into account. You could:

  • Give priority to results which have similar content as previously chosen results by the user.
  • Enumerate the results which are more relevant to the person based on location, language, other known facts.
  • Enumerate more popular results first (again, most popular within a particular region/age group/occupation group, or other groups based on known facts about the user).

These are only the basic sorting criteria, the ones that came to mind. When one starts developing and maintaining, hundreds of other criteria will come to mind and will have the possibility to be implemented. Now think about how each one would be implemented. There could be thousands of fields characterizing each resource, and each new feature will need additional data.

You could do that with EAV pattern in the relational database, which will give you some flexibility, or you could use NoSQL, which is built exactly for such purposes.

Again, this is just a reason to use NoSQL. I know many more reasons to use RDBMS.

Alex
  • 14,338
  • 5
  • 41
  • 59
  • +1 for descriptive answer! Consider the simple case (no relevancy of results), which No-SQL Non-relational database has a better performance comparing with mysql? Moreover, a simple relevancy can be conducted by the number of keywords (searched terms) associated with an article. – Googlebot Oct 16 '11 at 11:32
  • High quality benchmarks are a luxury when talking about real life databases. That's why the search engines are so reluctant to provide relevant results when searching "database engine benchmark" :). All I can say is that enormous reports were working wonderfully on large tables (millions of rows), on a server with 1Gb RAM, 2 single core XEON CPUs, and 2 old SCSI drives, at the same time as data was constantly pouring in (up to 50 rows per second), on MSSQL 2000. I have no doubt in the good performance of RDBMS engines which are fine tuned. – Alex Oct 16 '11 at 20:36