3

I am trying to code for our server in which I have to find users access type by URL.

Now, at the beginning, we see 100 millions distinct URL's are accessed per day. Now, by the time going it became nearly 600 millions distinct URL's per day.

For 100 millions, what we did is following:

1) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.

2) Then search the HashMap to find how many time URL accessed.

Now, as the HashTable become larger, what we did is following:

1) Build two/three separate HashTable, and load and store it (on general file system) to find how many times URL accessed.

Now, issue is,

1) Though the HashTable performance is quite nice, code takes more time while loading/storing HashTable (we are using File Channel, takes 16-19 seconds to load/store HashTable - 200 millions entry- as load factor is 0.5)

What we are trying to ask is:

1) Any comment how to solve this issue ?

2) How to reduce load/store time (I asked before but seems File Channel is the best way) ?

3) Is storing a large HashTable (more than memory) and caching it repeatedly will be a nice solution ? If so, how to do that (at least some pointers). We tried it by using

RandomAccessFile raf = new RandomAccessFile("array.dat", "rw");
IntBuffer map = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, 1 << 30).order(ByteOrder.nativeOrder()).asIntBuffer();

However, gives worser performance than previous.

Thanks.

NB:

1) As per previous suggestions of Stack Overflow, we use some NoSQL DB like TokyoCabinet but from our experience, a custom HashTable gives better performance than it on 100 millions key-value pairs.

2) Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.

What We forgot to mention is:

1) As our application is a part of project and to be applied on a small campus, so we assume URL accessed is not more than 800 million. So, you can think 600/700 data value is fixed.

2) Our main concern is performance.

3) We have to run our application locally.

Edit: code of our hashmap can be found here.

Community
  • 1
  • 1
Arpssss
  • 3,850
  • 6
  • 36
  • 80
  • @Hans, Tokyo/Kyoto cabinet. Too slow. – Arpssss Jul 03 '12 at 14:00
  • Can there be tons of values per key? Sounds like a hash table that holds lists of ints – Kevin DiTraglia Jul 03 '12 at 14:01
  • 2
    Try something like Coherence or Terracotta. Writing something on your own is unlikely to turn out well. – duffymo Jul 03 '12 at 14:01
  • @KDiTraglia, sorry I should mention. Not much. Maximum 10-15. And not for all keys. – Arpssss Jul 03 '12 at 14:02
  • so it's many to one mapping but not one to many, i.e. a -> v, b -> v, but not a -> v1, a -> v2? – Hans Z Jul 03 '12 at 14:02
  • @Hans, Look, I have to find a and corresponding v's. – Arpssss Jul 03 '12 at 14:06
  • 1
    This is not an object-oriented solution. Just tossing stuff into a map doesn't seem like a good idea, even if you can do it. Voting to close. – duffymo Jul 03 '12 at 14:16
  • 2
    What's the fill ratio of your hash map? Is your usual mode of operation 1. load map, 2. insert keys for one day, 3. save map? Otherwise give details. What exactly do you mean by second url part as value? Do you attempt to count number of distinct URLs, or number of accesses per URL, or both? Where does the conversion from URL to long/int take place, and how? Could it be that most `put` calls are for a small number of keys, resulting in bad performance due to linear time collision handling? – MvG Jul 11 '12 at 14:31
  • @MvG +1 - The most basic question is still unanswered: *What* is stored in the map? What are the keys and what are the values? What is the mapping's semantics, what does "key 'A' is mapped to value 'B'" mean in your application? – JimmyB Jul 16 '12 at 17:53

12 Answers12

6

It might be best to access the table as a memory-mapped buffer. That way, you could simply implement random access to the file, without worrying about loading and storing, and leave caching to the operating system. I see that your current implementation already does use memory-mapped access for reading and writing, but it still loads things into the java heap in between. Avoid this data duplication and copying! Treat the backing file itself as the data structure, and only access the portions of it that you actually need, only when you need them.

Within that file, hash maps will work if you are really really sure that hash collisions are not an issue. Otherwise I'd go for a B+ tree there, with nodes about the size of your hard disk pages. That way, each disk access will yield a lot more of usable data than just a single key, thus resulting in a more shallow tree and less individual disc operations.

I guess others will have implemented stuff like this, but if you prefer your own hash map implementation, you might prefer to write your own memory-mapped B+ trees as well.

Community
  • 1
  • 1
MvG
  • 57,380
  • 22
  • 148
  • 276
3

The whole approach sounds ridiculus to me. I gather what you really want to achive is a simple access counter per distinct URL. By its very nature, this data is frequently written but rarely ever read.

For this purpose, I would simply have a database table and add a new entry for every access (it can serve as log as well). When you need to figure out how often any URL was accessed this can be easily done using a SELECT COUNT from the table (depending on how much additional data you store along with the URL entries, you can even do constrainted counts like how often accessed yesterday, last week etc).

This puts all the work off to the point where the result is really needed.

BTW, you may be able to retrieve the access counts from the web servers log files as well, so maybe you don't need to write any data yourself. Look into this first.

Durandal
  • 19,919
  • 4
  • 36
  • 70
  • Thanks for your answer. Look, URL accessed are given to me as a simple file per day according to the need. I have no way to change that. So, from that simple file of 600 millions URLS, I have to implement a fast look up table. I don't think, sql db will be so faster for lookup. – Arpssss Jul 10 '12 at 13:18
  • The question does not specify what you were given to process, to me the wording implied that the context is running inside some server which in turn let me assume that you are collecting the data on the fly. Turns out thats not the case :) – Durandal Jul 10 '12 at 14:12
  • 1
    So, when you say "distinct URL's", do you *really* mean DISTINCT as in there are 700 million *KEYS* in your hashmap, or did you mean in fact that your file contains has 700M logged accesses? – Durandal Jul 10 '12 at 14:16
  • 1
    This is very different from what you asked in the question - for the Hashmap the number of *keys* should be the deciding performance factor. This leads me to believe that your real problem is IO performance and the Hashmap has *nothing* to do with it. I recommend you investigate where your real problem lies (Profile application, do some rough testing while eliminating IO etc) and then update your question. I think it unlikely you get any useful answers with the current state of the question. – Durandal Jul 10 '12 at 14:21
  • HashMap performance is OK. But, I/O performance is worse. Most of the times spent on I/O. I have talked with Peter (http://stackoverflow.com/questions/11317664/java-swapping-page/11317746#11317746 - last section) about this and waiting for a solution. – Arpssss Jul 10 '12 at 14:34
1

You can use a caching framework like JCS. 1 billion key-value pairs should not be a problem.

http://commons.apache.org/jcs/

Sree
  • 746
  • 6
  • 21
  • All the disk-memory swapping the framework will manage. I've not tried it with 1 billion records. – Sree Jul 03 '12 at 14:08
  • Is it applicable for key with multiple values ? I have not find anywhere ? Another point, I have my own custom HashMap implementation - created using two parallel arrays. Can I use your above mentioned JCS for that ? Note, I have to store and load HashMap to memory also for future use. For more, http://stackoverflow.com/questions/11398762/custom-hashmap-code-issue – Arpssss Jul 09 '12 at 17:28
0

Definitely try redis, think it beats anything else hands down

sfk
  • 625
  • 1
  • 9
  • 17
  • Redis supports key-multiple values ? – Arpssss Jul 03 '12 at 14:07
  • yah. key can have multiple values. – Arpssss Jul 03 '12 at 14:09
  • 1
    on the home page "keys can contain strings, hashes, lists, sets and sorted sets." – sfk Jul 03 '12 at 14:10
  • @Arpssss, the classic definition of a key-value storage mechanism (a map), states that a key may only have one value. I am not even sure how key->multiplie values may even be possible since it would violate the necessary one-to-one property of this mechanism. – jn1kk Jul 03 '12 at 15:06
  • hmm, to quote "In order to achieve its outstanding performance, Redis works with an in-memory dataset" . This is a dataset that does not fit in memory. – maniek Jul 16 '12 at 19:03
0

You can use Berkeley DB which is basically a key/value store written in C for ultimate performance. It's an Oracle product (Open Source though) so I would take it serious.

0

If your application has to run locally without the usage of any external computing power, there is no solution which can be more performant then direct-memory access: the only data structure which can provides you better performances then an HashMap is an array, where the access at every element is O(1). This requires however knowing in advance how many items you have, have a unique addressing index per element, and also being able of reserving significant adjacent memory.

After arrays, which as described are suitable for limited cases, you have HashTables, however as the size of the data grows, the cost with collisions and dynamic resize increase and makes the performance poor.

You can refer to java.util.HashMap javadoc but also to Wikipedia http://en.wikipedia.org/wiki/Hash_table to understand the following:

  • How expensive is it to compute it?
  • How the value are well distributed?
  • What is the load factor that you are using, i.e. what cost will you have for conflict resolution?
  • How often will you need to resize your HashMap before you get to have it full contained all data?

If your performance degradates when building your HashMap, which I actually believe it's a ConcurrentHashMap (if you build it parallely it has to be thread safe), you might want to investigate why it happens.

A simple, but easy beginning will be to replace your HashMap with a TreeMap, whose performances are a deterministic function of its size, and compare the two performances.


If on the other side I misinterpreted your question and you have the opportunity to scale on multiple machines the computation, you have plenty of interesting solution on the market as someone has already pointed out, and to which I would add Cassandra.

These solutions achieve performance improvement by distributing the load among multiple nodes, but inside each node use well-known algorithm for fast and efficient addressing.

Edmondo
  • 19,559
  • 13
  • 62
  • 115
0

Not clear for question and follow-up discussion, but what's the nature of your queries? You've got very different situations between
a) working through all ~700 million URLs during each working day, or
b) hitting some small number of those ~700 million URL's.

So: what's the ratio of # of queries to the # of URLs?

From your descriptions, it sounds like you may be loading/unloading the different files representing different portions of your array... which suggests random queries, which suggests (b).

As well, I gather you've already recognized that "all-in-memory" isn't feasible (i.e. you've broken the array across multiple files), so an optimal disk-access algorithm seems to be the next order of business, no?

Have you tried, per query, a simple seek (n * arrayElementSize) to offset in file and just read a few pages into memory (do you have/know a maximum # of values per key?). You've already got (computed) the base index into your array, so this should be easy to prototype.

Richard Sitze
  • 8,262
  • 3
  • 36
  • 48
0

I would suggest you to use Oracle Coherence Cache. You can get all the benefits of HashTable it has all the methods which Map has.

Performance wise you can store data as per you requirement.Please have a look .

amicngh
  • 7,831
  • 3
  • 35
  • 54
0

You can try HugeCollections, I think it was written for this purpose

HugeCollections
Library to support collections with millions or billions of entries.

specifically HugeMap

epoch
  • 16,396
  • 4
  • 43
  • 71
0

Use open source sqlite in memory database.

Kamahire
  • 2,149
  • 3
  • 21
  • 50
0

If I understand you correctly, your data structure is not that big

[(32 + 64) * 600 million] bits i.e. a 53.644 MB structure in memory

The map data structure would consume some space too. I've found out the hard way that trove is one of the most memory efficient data structures around. I'd use a TLongIntHashMap to store long keys and integer values. It stored raw primitives so that you bypass the Long and Integer memory objects

qwerty
  • 3,801
  • 2
  • 28
  • 43
0

It seems You have a mostly read-only dataset that does not fit in memory, and You need fast key-lookups. I am afraid there is no silver-bullet solution here, except for a few possible tradeoffs.

If You access the 600M records all-over-the-place, No matter what You do You are going to be limited by disk random access speed (not sequential access speeed). Use FileChannel.map to directly access the file (no, don't read the contents of the file in memory, just operate on the MappedByteBuffer. Your OS will take care of caching for You). Investing in a SSD looks to be a good way to spend money (or maybe just buy some more memory?).

This is a campus environment, right? Maybe You can use computers in a lab to make a memcached/redis/etc. cluster? Maybe You could use it off-hours?

If You access some identifiable pieces of data at the same time (i.e. now we analyze domain a, then b, etc.), then splitting the data into buckets is a good idea. Like keep the related data physically close, to help caching. Or maybe pre-sort the urls, and access them in binary-search fashion?

If some probability of collisions is acceptable, maybe not storing the full urls but only 64-bit hashes of urls as hash keys is acceptable? With some gymnastics You could probably get away with not storing the keys at all?

That's my ideas for the moment.

maniek
  • 7,087
  • 2
  • 20
  • 43