0

We have a scenario where we get tons of urls from customers, the urls are organized in arbitrary levels like : xxx.com/levelA/levelB/levelC/...levels.../xxxx we are trying to use this data and build a query system that can answer what urls are under any given level. for example, getAll("abc.com/test/sub/"), should give me all the urls that's been recorded that has "abc.com/test/sub/" as prefix, abc.com/test/sub/a.data, abc.com/test/sub/sub2/data etc.

This appears to be similar to a file directory structure. My question is, is there any existing open source project that can help handle such scenario. requirement is :

  • real time system.
  • high write/read throughput.
  • distributed and reliable.
peter
  • 14,348
  • 9
  • 62
  • 96
user1462192
  • 1,085
  • 2
  • 8
  • 4

1 Answers1

1

Some questions you didn't answer:

  • What is a high write/read throughput? anything which an RDMS couldn't handle? What is your expected ratio of reads VS writes?
  • Why do you want to have a distributed system? Any particular reason?
  • How long are the strings an average and maximal?

Are you sure a simple MySQL, PostgreSQL, or any other commercial database (Oracle, SQL Server, ...) won't be enough?

Here is a question about MySQL varchar index length. I've encountered the same limitation of 255 characters also in SQL Server so I assume similar restrictions will exist for other RDBS. However there is nothing easier than just calling

SELECT url FROM url_list WHERE url like 'abc.com/test/sub/%'

There is also MongoDB which can be easily distributed and allows to use Regular Expressions in queries. Together with an index you could perform a similar request as in SQL. You would need to benchmark this specific case for yourself to see if there is a performance difference, and which system has it.

Otherwise, there would be still Couchbase and CouchDB which offer Views, which are basically made for something similar since they are built via MapReduce. However those take few seconds, up to a minute to be updated. So it isn't really fitting if you want to request the URL right after you've inserted it.

Community
  • 1
  • 1
peter
  • 14,348
  • 9
  • 62
  • 96