Scala: Find edit-distance for all elements of a list not fitting in memory

Question

In my previous question I was asking for advice on an algorithm to compare all elements in huge list: Scala: Compare all elements in a huge list

A more general problem I am facing and would be grateful to get some advice is to do approximate comparison of list elements for a list not fitting into memory all at once. I am building this list from SQL request that returns a cursor to iterate a single string field of about 70 000 000 records. I need to find edit-distance (http://en.wikipedia.org/wiki/Edit_distance) between every two string elements in this list.

My idea here is to use sliding window of N records to compare all 70 000 000 records:

Read N elements into a list that nicely fit into memory (N ~ 10 000)
Calculate edit-distance for all elements in this list using algorithm described here: Scala: Compare all elements in a huge list
Read next N elements (from N to 2N-1) into a new list. Compare all these as in 2.
Rewind SQL query cursor to the first record
Compare every string starting from index 0 to N with all strings in this new list using the same algorithm as in 2.
Slide window to read strings form 2N to 3N-1 records into a new list
Compare every string starting from index 0 to 2N with all strings in this new list using the same algorithm as in 2.
etc.

All comparison results I need to write into DB as (String, String, Distance) records where first two elements are strings to match and third is a result.

Questions:

How to force Scala to garbage collect unneeded lists from the previous steps of this algorithm?
This algorithm is awful in terms of number of calculations required to do the job. Any other algorithms, ideas on how to reduce complexity?

Thanks!

As writen, there's nothing much you can do. If you need all pair-wise comparisons, then you need to do 2.4 _billion_ comparisons and end up with a list of 2.4 billion elements. I find it unlikely you're going to do anything else with that many elements, so what happens next? What are you going to do with the results of this comparison? — The Archetypal Paul, Jul 06 '14 at 19:48
Do you really need the edit distance for every pair of records? If you only need the edit distance for records that are within a certain distance from one another, there are heuristics you can apply. — wingedsubmariner, Jul 06 '14 at 20:04
There are some exact duplicates, this percentage is unknown. To reduce number of computations only records within a certain distance from one another may be of interest. — DarqMoth, Jul 06 '14 at 20:36
Can you say more about what will be done with the results? I'm wondering if there are ways of not having to consider all the pairs — The Archetypal Paul, Jul 06 '14 at 20:48
The goal is to reduce duplicates. For this some edit-distance threshold will be determined.Records with `edit-distance < theshhold` will be considered duplicates. — DarqMoth, Jul 06 '14 at 20:54
So... can we discard one of a pair as soon as we discover edit-distance < threshhold, and continue the comparisons with the other? Or might A + edit-distance B + edit-distance C be outside the threshold? How many edit-distance-duplicates are expected? — The Archetypal Paul, Jul 06 '14 at 21:43
Some thoughts: do you have a way of determining which string of a similar pair is more "correct"? Can you cluster your string set? The problem is: lets say you have strings "aaa" "aab" "aba" "baa" "abb" "bab" "bba" "bbb". Which ones are "duplicates" and which of them should be deleted? Can you use a vocabulary for clustering? — enlait, Jul 07 '14 at 05:47
_All comparison results I need to write into DB..._ Btw if you have for example strings of length 50 and store distance as long, you'll end up needing 240 petabytes (240 000 Tb) of storage space. — enlait, Jul 07 '14 at 05:57
@enlait and even if each such comparison record were just 1 byte long, it's still 4 petabytes, so it doesn't seem like various optimizations (like storing ids instead of strings themselves) can provide feasible solution — om-nom-nom, Jul 07 '14 at 06:05
@Paul Yes, we can discard a pair as soon as we discover edit-distance < threshhold, and continue the comparisons with the other. There is no apriory knowledge of how many edit-distance-duplicates are expected. And I am NOT bulding transitive closure (http://en.wikipedia.org/wiki/Transitive_closure) to assume that if (A similar B) and (B similar C) then (A similar C). Just need to know how well distance-edit metric works with particular string fields in particular DB. — DarqMoth, Jul 07 '14 at 08:05
@enlait _lets say you have strings "aaa" "aab" "aba" "baa" "abb" "bab" "bba" "bbb". Which ones are "duplicates"_ is determined by edit-distance threshhold which is an adjustable parameter, for example 1,2 or 3 — DarqMoth, Jul 07 '14 at 08:10
I'd still like to know what you're going to do with the results. As @om-nom-nom says, there are petabytes of storage needed for the current output... so it seems to me you can't afford to generate the lot. — The Archetypal Paul, Jul 07 '14 at 09:39
@Paul This work should eventually result in framework to test different metrics for approximate field matching (not only edit-distance) such as these: https://github.com/rockymadden/stringmetric. I need to store comparison results to collect metric performance statistics and analyze it by other tools. — DarqMoth, Jul 07 '14 at 09:57
Do you have the capacity, then? That's a big system which makes me wonder why you seem to be planning to run this on a single machine? Petabytes of infomation, billions of comparisons, multiple runs. What does it mean to 'test different metrics' and do you need all those 2.4 billion results to do so, or would a sampling approach work? (Very curious about the problem domain here, but you may not be able to share that, I guess) — The Archetypal Paul, Jul 07 '14 at 10:20
I need to do something quick first and then move this stuff to Hadoop. To get fast some idea of metric quality for the data I have is important. Sampling may be a good idea for this, thanks for the insight! Do you have any ideas how in this case data can be sampled easy enough? Do some random SQL selects with some limit imposed on the result? — DarqMoth, Jul 07 '14 at 12:12
@DarqMoth moving this stuff to hadoop won't automatically solve your *huge data* problem, unless you're operating at google level. As for the sampling techniques, [there is a question for that](http://stackoverflow.com/q/249301/298389). — om-nom-nom, Jul 07 '14 at 21:26
@DarqMoth, I can't really answer how you can sample because you have't really told us the properties of your input data set. Possibly the first (say) 1% is a sufficiently representative sample, possibly you need to take every 100th row, or something else. — The Archetypal Paul, Jul 09 '14 at 11:50

Scala: Find edit-distance for all elements of a list not fitting in memory

0 Answers0

Linked