0

In my previous question I was asking for advice on an algorithm to compare all elements in huge list: Scala: Compare all elements in a huge list

A more general problem I am facing and would be grateful to get some advice is to do approximate comparison of list elements for a list not fitting into memory all at once. I am building this list from SQL request that returns a cursor to iterate a single string field of about 70 000 000 records. I need to find edit-distance (http://en.wikipedia.org/wiki/Edit_distance) between every two string elements in this list.

My idea here is to use sliding window of N records to compare all 70 000 000 records:

  1. Read N elements into a list that nicely fit into memory (N ~ 10 000)
  2. Calculate edit-distance for all elements in this list using algorithm described here: Scala: Compare all elements in a huge list
  3. Read next N elements (from N to 2N-1) into a new list. Compare all these as in 2.
  4. Rewind SQL query cursor to the first record
  5. Compare every string starting from index 0 to N with all strings in this new list using the same algorithm as in 2.
  6. Slide window to read strings form 2N to 3N-1 records into a new list
  7. Compare every string starting from index 0 to 2N with all strings in this new list using the same algorithm as in 2.
  8. etc.

All comparison results I need to write into DB as (String, String, Distance) records where first two elements are strings to match and third is a result.

Questions:

  1. How to force Scala to garbage collect unneeded lists from the previous steps of this algorithm?
  2. This algorithm is awful in terms of number of calculations required to do the job. Any other algorithms, ideas on how to reduce complexity?

Thanks!

Community
  • 1
  • 1
DarqMoth
  • 603
  • 1
  • 13
  • 31
  • 1
    As writen, there's nothing much you can do. If you need all pair-wise comparisons, then you need to do 2.4 _billion_ comparisons and end up with a list of 2.4 billion elements. I find it unlikely you're going to do anything else with that many elements, so what happens next? What are you going to do with the results of this comparison? – The Archetypal Paul Jul 06 '14 at 19:48
  • Are all the strings unique? Or are there duplicates? – The Archetypal Paul Jul 06 '14 at 19:53
  • Do you really need the edit distance for every pair of records? If you only need the edit distance for records that are within a certain distance from one another, there are heuristics you can apply. – wingedsubmariner Jul 06 '14 at 20:04
  • There are some exact duplicates, this percentage is unknown. To reduce number of computations only records within a certain distance from one another may be of interest. – DarqMoth Jul 06 '14 at 20:36
  • Can you say more about what will be done with the results? I'm wondering if there are ways of not having to consider all the pairs – The Archetypal Paul Jul 06 '14 at 20:48
  • The goal is to reduce duplicates. For this some edit-distance threshold will be determined.Records with `edit-distance < theshhold` will be considered duplicates. – DarqMoth Jul 06 '14 at 20:54
  • So... can we discard one of a pair as soon as we discover edit-distance < threshhold, and continue the comparisons with the other? Or might A + edit-distance B + edit-distance C be outside the threshold? How many edit-distance-duplicates are expected? – The Archetypal Paul Jul 06 '14 at 21:43
  • Some thoughts: do you have a way of determining which string of a similar pair is more "correct"? Can you cluster your string set? The problem is: lets say you have strings "aaa" "aab" "aba" "baa" "abb" "bab" "bba" "bbb". Which ones are "duplicates" and which of them should be deleted? Can you use a vocabulary for clustering? – enlait Jul 07 '14 at 05:47
  • 1
    _All comparison results I need to write into DB..._ Btw if you have for example strings of length 50 and store distance as long, you'll end up needing 240 petabytes (240 000 Tb) of storage space. – enlait Jul 07 '14 at 05:57
  • @enlait and even if each such comparison record were just 1 byte long, it's still 4 petabytes, so it doesn't seem like various optimizations (like storing ids instead of strings themselves) can provide feasible solution – om-nom-nom Jul 07 '14 at 06:05
  • @Paul Yes, we can discard a pair as soon as we discover edit-distance < threshhold, and continue the comparisons with the other. There is no apriory knowledge of how many edit-distance-duplicates are expected. And I am NOT bulding transitive closure (http://en.wikipedia.org/wiki/Transitive_closure) to assume that if (A similar B) and (B similar C) then (A similar C). Just need to know how well distance-edit metric works with particular string fields in particular DB. – DarqMoth Jul 07 '14 at 08:05
  • @enlait _lets say you have strings "aaa" "aab" "aba" "baa" "abb" "bab" "bba" "bbb". Which ones are "duplicates"_ is determined by edit-distance threshhold which is an adjustable parameter, for example 1,2 or 3 – DarqMoth Jul 07 '14 at 08:10
  • I'd still like to know what you're going to do with the results. As @om-nom-nom says, there are petabytes of storage needed for the current output... so it seems to me you can't afford to generate the lot. – The Archetypal Paul Jul 07 '14 at 09:39
  • @Paul This work should eventually result in framework to test different metrics for approximate field matching (not only edit-distance) such as these: https://github.com/rockymadden/stringmetric. I need to store comparison results to collect metric performance statistics and analyze it by other tools. – DarqMoth Jul 07 '14 at 09:57
  • 1
    Do you have the capacity, then? That's a big system which makes me wonder why you seem to be planning to run this on a single machine? Petabytes of infomation, billions of comparisons, multiple runs. What does it mean to 'test different metrics' and do you need all those 2.4 billion results to do so, or would a sampling approach work? (Very curious about the problem domain here, but you may not be able to share that, I guess) – The Archetypal Paul Jul 07 '14 at 10:20
  • I need to do something quick first and then move this stuff to Hadoop. To get fast some idea of metric quality for the data I have is important. Sampling may be a good idea for this, thanks for the insight! Do you have any ideas how in this case data can be sampled easy enough? Do some random SQL selects with some limit imposed on the result? – DarqMoth Jul 07 '14 at 12:12
  • 1
    @DarqMoth moving this stuff to hadoop won't automatically solve your *huge data* problem, unless you're operating at google level. As for the sampling techniques, [there is a question for that](http://stackoverflow.com/q/249301/298389). – om-nom-nom Jul 07 '14 at 21:26
  • @om-nom-nom What you mean by _operating at google level_ ? – DarqMoth Jul 08 '14 at 07:53
  • @DarqMoth I mean gigantic cluster of machines – om-nom-nom Jul 08 '14 at 08:49
  • @DarqMoth, I can't really answer how you can sample because you have't really told us the properties of your input data set. Possibly the first (say) 1% is a sufficiently representative sample, possibly you need to take every 100th row, or something else. – The Archetypal Paul Jul 09 '14 at 11:50

0 Answers0