0

I'm looking for the best solution for a problem that i have (-:

i have k csv files (5 csv files for example), each file has m fields which produce a key and n values. I need to produce a single csv file with aggregated data. for example

file 1: f1,f2,f3,v1,v2,v3,v4
        a1,b1,c1,50,60,70,80
        a3,b2,c4,60,60,80,90 

file 2: f1,f2,f3,v1,v2,v3,v4
        a1,b1,c1,30,50,90,40
        a3,b2,c4,30,70,50,90

result: f1,f2,f3,v1,v2,v3,v4  
        a1,b1,c1,80,110,160,120
        a3,b2,c4,90,130,130,180

algorithm that we thought until now:

  1. hashing (using concurentHashTable)

  2. merge sorting the files

  3. DB: using mysql or hadoop.

The solution needs to be able to handle Huge amount of data (each file more than two million rows)

a better example: file 1

country,city,peopleNum
england,london,1000000
england,coventry,500000

file 2:

country,city,peopleNum
england,london,500000
england,coventry,500000
england,manchester,500000

merged file:

country,city,peopleNum
england,london,1500000
england,coventry,1000000
england,manchester,500000

the key is: country,city of course...this is just an example...my real key is of size 6 and the data columns are of size 8 - total of 14 columns

royB
  • 12,779
  • 15
  • 58
  • 80
  • There are tools that might do that. I played a little with "kettle" and it might be interesting. – SJuan76 Aug 04 '13 at 07:41
  • Please update your question to include how you're uniquely identifying a row (how do you know when two rows can be "merged"), and how you're merging such a row (apparently adding some of the values). I believe I see how you're doing it, but you don't want me or anyone making false assumptions. – Richard Sitze Aug 04 '13 at 16:02
  • Does my response to this [seemingly unrelated question](http://stackoverflow.com/questions/18048669/caching-strategy-for-large-immutable-objects-in-scala/18050407#18050407) suggest a workable hash mechanism? – Richard Sitze Aug 05 '13 at 11:22
  • @RichardSitze thanks for your response. your answer is more of implementation detail. i'm not looking for one, i'm just trying to figure out what is my best approach regarding runtime, memory consumption and data integrity. – royB Aug 06 '13 at 10:31

1 Answers1

1

I think the answer really depends

1) If you need a ready made solution then splunk might be way yo go (http://splunk-base.splunk.com/answers/6783/handling-large-amount-of-csv-files-as-input-and-rename-sourcetype-as-well-as-specify-header)

2) IF you have infrastructure / bandwidth / development tume for Hadoop then go create a solution

3) if this is a one time job create a merge sort solution (i have processed 2 TB files in bash using sed / awk / sort)

4) A custom solution if you don't like any of the above.

Mehul Rathod
  • 1,244
  • 8
  • 7
  • I'm more interested in a custom solution. also why you think hadoop solution is better? why sorting and not hashing? what kind of custom solution? – royB Aug 04 '13 at 08:11
  • Hadoop is better because it can scale well as you add more data, and the sorting and merging is taken care of in reduce phase automatically. The custom solution will really depend on what kind of machine you are running on, if data does not fit in RAM you will have to do disk based sort which slows things down and as the data grows application will now scale well. – Mehul Rathod Aug 04 '13 at 09:56