0

I want to implement a method to merge two huge file (the files contains JsonObject for each row) through a common value.

The first file is like this:

{
"Age": "34",
"EmailHash": "2dfa19bf5dc5826c1fe54c2c049a1ff1",
"Id": 3,
 ...
}

and the second:

{
"LastActivityDate": "2012-10-14T12:17:48.077",
"ParentId": 34,
"OwnerUserId": 3,
}

I have implemented a method that read the first file and take the first JsonObject, after it takes the Id and if in the second file there is a row that contains the same Id (OwnerUserId == Id), it appends the second JsonObject to the first file, otherwise I wrote another file that contains only the row that doesn't match with the first file. In this way if the first JsonObject has 10 match, the second row of the first file doesn't seek these row.

The method works fine, but it is too slow. I have already trying to load the data in mongoDb and query the Db, but it is slow too. Is there another way to process the two file?

ilamaiolo
  • 345
  • 2
  • 4
  • 14
  • How did you implement your method? How many parent and child records are you dealing with? – Eric J. Apr 17 '14 at 23:48
  • possible duplicate of [How do I perform the SQL Join equivalent in MongoDB?](http://stackoverflow.com/questions/2350495/how-do-i-perform-the-sql-join-equivalent-in-mongodb) – Eric J. Apr 17 '14 at 23:49
  • The first file contains 60.000 row and the second file 550.000 My method is very difficult to explain, but at high level I use always two files, one to read the data and the second to put the rows that doesn't match (basically at the second iteration the new row of the first file will read on the latest temp file that I have created). – ilamaiolo Apr 17 '14 at 23:57
  • Why create temporary files? If you convert the JSON objects to Java objects, most systems have plenty of RAM to read the first and second "files" (could actually be results directly from MongoDB, in memory) into Dictionary data structures, and perform the join in RAM. Should be much faster than working with temp files. – Eric J. Apr 18 '14 at 00:19
  • Because I have already work in this way and I computed maybe 1 row for 1 sec (it is still too much slow for my work), but with the temp file I guess is more slow. – ilamaiolo Apr 18 '14 at 00:45
  • If 1 row is taking 1 sec in memory, something is very wrong. Suggest you post the relevant code. – Eric J. Apr 18 '14 at 00:54
  • 1 row of the first file should be compared with all the 600000 rows of the second file (for this reason if I found a row that matched I wrote another tmp file, so the second row can read only the rows that doesn't matches with the first row of the first file)... – ilamaiolo Apr 18 '14 at 01:05
  • That comparison is extremely fast if you use a Dictionary to hold the "second file" rows, in memory. If you keep looping through a file, it will be extremely slow. – Eric J. Apr 18 '14 at 01:06
  • The problem is if I try to load all the info of the second Json file my laptop throws a heap space exception! – ilamaiolo Apr 18 '14 at 01:10
  • Like I said... convert it to native Java objects first (the JSON representation is just too big). Sorry, this conversation is getting to be too long to be useful for the Stack Overflow format so I'm going to bow out. – Eric J. Apr 18 '14 at 01:26
  • How exactly are you "merging" here? Is the intention to add the second file contents as array members in the resulting document. Your question could do with a sample of what you expect as a result. – Neil Lunn Apr 18 '14 at 06:38

2 Answers2

0

What you're doing simply must be damn slow. If you don't have the memory for all the JSON object, then try to store the data as normal Java objects as this way you surely need much less.

And there's a simple way needing even much less memory and only n passes, where n is the ratio of required memory to available memory.

On the ith pass consider only objects with id % n == i and ignore all the others. This way the memory consumption reduces by nearly factor n, assuming the ids are nicely distributed modulo n.

If this assumption doesn't hold, use f(id) % n instead, where f is some hash function (feel free to ask if you need it).

maaartinus
  • 44,714
  • 32
  • 161
  • 320
0

I have solved using a temporary DB. I have created a index with the key in which I want to make a merge and in this way I can make a query over the DB and the response is very fast.

ilamaiolo
  • 345
  • 2
  • 4
  • 14