I have about 2 million records which have about 4 string fields each which needs to be checked for duplicates. To be more specific I have name, phone, address and fathername as fields and I must check for dedupe using all these fields with rest of data. The resulting unique records need to be noted into db.
I have been able to implement mapreduce, iterarate of all records. Task rate is set to 100/s and bucket-size to 100. Billing enabled.
Currently, everything is working, but performance is very very slow. I have been able to complete only 1000 records dedupe processing among a test dataset of 10,000 records in 6 hours.
The current design in java is:
- In every map iteration, I compare the current record with the previous record
- Previous record is a single record in db which acts like a global variable which I overwrite with another previous record in each map iteration
- Comparison is done using an algorithm and result is written as a new entity to db
- At the end of one Mapreduce job, i programatically create another job
- The previous record variable helps the job to compare with next candidate record with rest of the data
I am ready to increase any amount of GAE resources to achieve this in shortest time.
My Questions are:
- Will the accuracy of dedupe (checking for duplicates) affect due to parallel jobs/tasks?
- How can this design be improved?
- Will this scale to 20 million records
- Whats the fastest way to read/write variables (not just counters) during map iteration which can be used across one mapreduce job.
Freelancers most welcome to assist in this.
Thanks for your help.