url shortening algorithm in c/c++ - interview

Question

Refer - https://stackoverflow.com/a/742047/161243

Above algo says that we use a DB to store the data. Now if interviewer says that you can't use a DB. Then in that case we can have a stucture:

struct st_short_url{
  char * short_url;
  char * url;
}

Then a hashtable - st_short_url* hashTable[N];

Now we can have an int id which is incremented each time or a random number generated id which is converted to base62.

Problem i see:

-- if this process terminates then i lose track of int id and complete hashTable from RAM. So do i keep writing the hashTable back to disk so that it is persisted? if yes, then a B-tree will be used? Also we need to write id to disk as well?

P.S. Hashtable+writing to disk is Database, but what if i can't use a DBMS? What if i need to come up with my own implementation?

Your thoughts please...

Another Question:

In general, How do we handle infinite redirects in URL shortening?

Writing a hashtable to disk is no different from any other database solution, except that you invented the database system yourself instead of relying on an existing one. — Emil Vikström, Apr 03 '12 at 16:39
It is, but what if i can't use a DBMS? What if i need to come up with my own implementation? — rajya vardhan, Apr 03 '12 at 16:40
I think it'd be a stretch to call, say, a .csv file a database, but it would suffice. — James, Apr 03 '12 at 16:44
Sorry if it sounds silly, but i am thinking about a possible interview scenario. It has been asked in Google and Amazon interviews. — rajya vardhan, Apr 03 '12 at 16:44
What about a lookup in .csv file? Wouldn't it be too slow if file is huge? — rajya vardhan, Apr 03 '12 at 16:46
@rajyavardhan: you could build an index of pointers into the .csv file. But that sounds an awful lot like a database... — Li-aung Yip, Apr 03 '12 at 16:52
If an interviewer says "you can't use a database", say politely that this requirement makes no sense. Everything is a database. A hashtable in main memory is a database. — n. m. could be an AI, Apr 03 '12 at 17:02
A database is just a structured collection of data. It's not at all a stretch to call a .csv file a database. It's not a database which is managed by a relational database management system (RDBMS), but that doesn't mean that it's not a database. — Keith Irwin, Apr 03 '12 at 21:11

score 2 · Answer 1 · answered Apr 03 '12 at 17:54

If you can't use a DB of any kind (i.e. no persistent storage; the file system is nothing but a primitive DB!), then the only way to do it which I see is lossless compression + encoding in allowed characters. The compression algorithm may employ knowledge about URLS (e.g. that it is very likely that they begin with either http:// or https://, quite a few go on with www. and the domain name most often ends in .com, .org or .net. Moreover you can always assume a slash after the host name (because http://example.org and http://example.org/ are equivalent). You also may assume that the URL only contains valid characters, and special-case some substrings which are very likely to occur in the URL (e.g. frequently linked domains, or known naming schemes for certain sites). Probaby the compression scheme should feature a version field so that you can update the algorithm when usage patterns change (e.g. a new web site gets popular and you want to special-case that as well, or a popular site changes its URL pattern which you special-cased) without risking the old links to go invalid.

Such a scheme could also be supported directly in the browser through an extension, saving server bandwidth (the server would still have to be there for those without a browser extension and as fallback if the extension doesn't yet have the newest compression data).

score 2 · Answer 2 · answered Apr 03 '12 at 18:56

2

The requirement isn't practical, but you don't have to give a practical answer. Just use the file system and he won't realize that.

To store:

convert input URL to a string e.g. base64 conversion.
make a file of that name
return the inode number as the short url (e.g. ls -i filename ) or stat() etc.

To retrieve:

get the inode number from user.
find / -inum n -print or some other mechanism.
convert that back to a URL from filename.

answered Apr 03 '12 at 18:56

pizza

7,296
1
25
22

Slow as hell; every access requires a linear search. The filename should be the hash. The file contents (or better, symlink contents) should be the url. – R.. GitHub STOP HELPING ICE Apr 04 '12 at 01:42
1

It is an interview question, it doesn't need to be fast, it just need to pass the interview. – pizza Apr 04 '12 at 03:50
I would highly consider rejecting a candidate who offered such an algorithmically bad solution... An interview problem is not just to see if you can solve the problem, but to see what sort of mistakes you make in the process. And mixing up the key and value in such a way that you have to linear-search for the entry you want on every lookup is a pretty bad mistake. – R.. GitHub STOP HELPING ICE Apr 04 '12 at 03:54
An interview problem also let the interviewee see what the interviewer is, may be someone has a high opinion of this interviewer than I do. – pizza Apr 04 '12 at 04:17

score 1 · Answer 3 · answered Apr 03 '12 at 17:23

A database is a data structure that supports insertion, removal and search of items. As has been pointed out in the comments to the OP, nearly everything is a database, so this constraint seems somewhat uninformed.

If you're not allowed to use an existing DBMS, you can resort to storing items on disk, making use of tmpnam() or a similar technique that doesn't suffer from race conditions. tmpnam() yields unique IDs, and you can use the associated file to store information.

url shortening algorithm in c/c++ - interview

3 Answers3