4

In some previous posts I have asked some questions about coding of Custom Hash Map/Table in java. Now as I can't solve it and may be I forgot to properly mentioning what I really want, I am summarizing all of them to make it clear and precise.

What I am going to do:

I am trying to code for our server in which I have to find users access type by URL.

Now, I have 1110 millions of URLs (approx).

So, what we did,

1) Divided the database on 10 parts each of 110 millions of Urls. 2) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.

3) Then search the HashMap for some other URLs (millions of URLs saved in one day) per day at the beginning when system starts.

What you have Tried:

1) I have tried many NoSQL databases, however we found not so good for our purpose.

2) I have build our custom hashmap(using two parallel arrays) for that purpose.

So, what the issue is:

When the system starts we have to load our hashtable of each database and perform search for million of url:

Now, issue is,

1) Though the HashTable performance is quite nice, code takes more time while loading HashTable (we are using File Channel & memory-mapped buffer to load it which takes 20 seconds to load HashTable - 220 millions entry - as load factor is 0.5, we found it most faster)

So, we are spending time: (HashTable Load + HashTable Search) * No. of DB = (5 + 20) * 10 = 250 seconds. Which is quite expensive for us and most of the time (200 out of 250 sec) is going for loading hashtables.

Have you think any-other way:

One way can be:

Without worrying about loading and storing, and leave caching to the operating system by using a memory-mapped buffer. But, as I have to search for millions of keys, it gives worser performance than above.

As we found HashTable performance is nice but loading time is high, we thought to cut it off in another way like:

1) Create an array of Linked Lists of the size Integer_MAX (my own custom linked list).

2) Insert values (int's) to the Linked Lists whose number is key number (we reduce the key size to INT).

3) So, we have to store only the linked lists to the disks.

Now, issue is, it is taking lots of time to create such amount of Linked Lists and creating such large amount of Linked Lists has no meaning if data is not well distributed.

So, What is your requirements:

Simply my requirements:

1) Key with multiple values insertion and searching. Looking for nice searching performance. 2) Fast way to load (specially) into memory.

(keys are 64 bit INT and Values are 32 bit INT, one key can have at most 2-3 values. We can make our key 32 bit also but will give more collisions, but acceptable for us, if we can make it better).

Can anyone help me, how to solve this or any comment how to solve this issue ?

Thanks.

NB:

1) As per previous suggestions of Stack Overflow, Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.

2) We have not found NoSQL db's are scaling well as our requirements are simple (means just insert hashtable key value and load and search (retrieve values)).

3) As our application is a part of small project and to be applied on a small campus, I don't think anybody will buy me a SSD disk for that. That is my limitation.

4) We use Guava/ Trove also but they are not able to store such large amount of data in 16 GB also (we are using 32 GB ubuntu server.)

Community
  • 1
  • 1
Arpssss
  • 3,850
  • 6
  • 36
  • 80
  • Have you taken a look at HazelCast? Maybe it will work in this situation. http://www.hazelcast.com/ – Marvo Aug 01 '12 at 18:49
  • @Marvo, sorry, I should mention that I have to run it locally not distributed environment way. This is because, trully I am not system designer and I have to do what I have told to do. – Arpssss Aug 01 '12 at 18:52
  • How much time is the linked-list approach actually taking? Just based on programming contest rules of thumb, I would be...kind of shocked if any Java implementation could possibly be significantly faster than a few minutes. – Louis Wasserman Aug 01 '12 at 18:53
  • @LouisWasserman, Some times it is failed to create such amount of large array. However, 1 million/sec. We are really worried that whether it will give load performance nice or not. – Arpssss Aug 01 '12 at 18:54
  • 1
    Personally, I'm pretty certain that no significantly better Java implementation exists. The only other techniques I can think of have unacceptably increased memory costs. ...Okay. I can think of one further change to make, but that one actually just reduces memory consumption and decreases speed a little, so I don't think you want it. – Louis Wasserman Aug 01 '12 at 18:59
  • @LouisWasserman, Look, I can provide maximum 20 GB of memory (that is quite high enough). But I want the overall time to be reduced properly. what you are thinking solves that ? – Arpssss Aug 01 '12 at 19:05
  • I don't believe it's possible. Not with Java, and not with the constraints as you've described them. – Louis Wasserman Aug 01 '12 at 19:07
  • 1
    @LouisWasserman, I have edited my new hashmap code (with linked list). Can you kindly take a look onto it and inform me is it possible not to create such large amount of linked lists or not (means on the fly). And is it possible to store it in such a way thus it should be faster to load. Note that, there is no requirements to load keys and we can store approx 500 millions of key-value pairs with that on 20 GB. That is what I am thinking. – Arpssss Aug 01 '12 at 19:16
  • This would be better moved to codereview.stackexchange.com or, better yet, a [gist](https://gist.github.com/) so I can actually edit code instead of trying to describe the changes in English. – Louis Wasserman Aug 01 '12 at 19:21
  • @LouisWasserman, Here it is: https://gist.github.com/3229931 – Arpssss Aug 01 '12 at 19:25
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/14763/discussion-between-louis-wasserman-and-arpssss) – Louis Wasserman Aug 01 '12 at 19:32
  • @LouisWasserman, I want another help from you, i.e on http://stackoverflow.com/questions/11777513/java-array-of-linked-lists. Is it possible to create such structure. That will solve the bitset purpose also. Also reduce lots of space requirements. – Arpssss Aug 02 '12 at 14:00

3 Answers3

0

If you need quick access to 1110 million data items then hashing is the way to go. But dont reinvent the wheel, use something like:

AW101
  • 1,620
  • 14
  • 15
  • Actually...disagree. The OP is going to _need_ a custom-built data structure if he expects to do this in RAM. – Louis Wasserman Aug 01 '12 at 19:24
  • Sounds like the OP already has a hash worked out for keying, so all that's needed is to serialise the custom data structure for the multi-values and store against the key. – AW101 Aug 01 '12 at 19:31
  • Eh? He can't afford that, either. He's saying that his entirely customized data structure -- which isn't really even a hash anymore; it's just a straight-up lookup table -- isn't fast enough. – Louis Wasserman Aug 01 '12 at 19:36
0

It seems to me (if I understand your problem correctly) that you are trying to approach the problem in a convoluted manner.
I mean the data you are trying to pre-load are huge to begin with (let's say 220 Million * 64 ~ 14GB). And you are trying to memory-map etc for this.
I think this is a typical problem that is solved by distributing the load in different machines. I.e. instead of trying to locate the linked list index you should be trying to figure out the index of the appropriate machine that a specific part of the map has been loaded and get the value from that machine from there (each machine has loaded part of this database map and you get the data from the appropriate part of the map i.e. machine each time).
Maybe I am way off here but I also suspect you are using a 32bit machine.
So if you have to stay using a one machine architecture and it is not economically possible to improve your hardware (64-bit machine and more RAM or SSD as you point out) I don't think that you can make any dramatic improvement.

Cratylus
  • 52,998
  • 69
  • 209
  • 339
0

I don't really understand in what form you are storing the data on disk. If what you are storing consists of urls and some numbers, you might be able to speed up loading from disk quite a bit by compressing the data (unless you are already doing that).

Creating a multithreaded loader that decompresses while loading might be able to give you quite a big boost.

Leo
  • 2,328
  • 2
  • 21
  • 41