I am trying to code for our server in which I have to find users access type by URL.
Now, at the beginning, we see 100 millions distinct URL's are accessed per day. Now, by the time going it became nearly 600 millions distinct URL's per day.
For 100 millions, what we did is following:
1) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.
2) Then search the HashMap to find how many time URL accessed.
Now, as the HashTable become larger, what we did is following:
1) Build two/three separate HashTable, and load and store it (on general file system) to find how many times URL accessed.
Now, issue is,
1) Though the HashTable performance is quite nice, code takes more time while loading/storing HashTable (we are using File Channel, takes 16-19 seconds to load/store HashTable - 200 millions entry- as load factor is 0.5)
What we are trying to ask is:
1) Any comment how to solve this issue ?
2) How to reduce load/store time (I asked before but seems File Channel is the best way) ?
3) Is storing a large HashTable (more than memory) and caching it repeatedly will be a nice solution ? If so, how to do that (at least some pointers). We tried it by using
RandomAccessFile raf = new RandomAccessFile("array.dat", "rw");
IntBuffer map = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, 1 << 30).order(ByteOrder.nativeOrder()).asIntBuffer();
However, gives worser performance than previous.
Thanks.
NB:
1) As per previous suggestions of Stack Overflow, we use some NoSQL DB like TokyoCabinet but from our experience, a custom HashTable gives better performance than it on 100 millions key-value pairs.
2) Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.
What We forgot to mention is:
1) As our application is a part of project and to be applied on a small campus, so we assume URL accessed is not more than 800 million. So, you can think 600/700 data value is fixed.
2) Our main concern is performance.
3) We have to run our application locally.