2

Suppose I am given a set of structured data. The data is known to be problematic, and I need to somehow "score" them on consistency. For example, I have the data as shown below:

fieldA | fieldB | fieldC
-------+--------+-------
foo    | bar    | baz
fooo   | bar    | baz
foo    | bar    | lorem
..     | ..     | ..
lorem  | ipsum  | dolor
lorem  | upsum  | dolor
lorem  | ipsum  | baz

So assume the first row is considered the correct entry because there are relatively more data in that combination compared to the records in second and third row. In the second row, the value for fieldA should be foo (inconsistent due to misspelling). Then in the third row, the value of fieldC should be baz as other entries in the dataset with similar values for fieldA (foo) and fieldB (bar) suggest.

Also, in other part of the dataset, there's another combination that is relatively more common (lorem, ipsum, dolor). So the problem in the following records are the same as the one mentioned before, just that the value combination is different.

I initially dumped everything to a SQL database, and use statements with GROUP BY to check consistency of fields values. So there will be 1 query for each field I want to check for consistency, and for each record.

SELECT    fieldA, count(fieldA)
FROM      cache
WHERE     fieldB = 'bar' and fieldC = 'baz'
GROUP BY  fieldA

Then I could check if the value of fieldA of a record is consistent with the rest by referring the record to the object below (processed result of the previous SQL query).

{'foo': {'consistency': 0.99, 'count': 99, 'total': 100}
 'fooo': {'consistency': 0.01, 'count': 1, 'total': 100}}

However it was very slow (dataset has about 2.2million records, and I am checking 4 fields, so making about 9mil queries), and would take half a day to complete. Then I replaced SQL storage to elasticsearch, and the processing time shrunk to about 5 hours, can it be made somehow faster?

Also just out of curiosity, am I re-inventing a wheel here? Is there an existing tool for this? Currently it is implemented in Python3, with elasticsearch.

Jeffrey04
  • 6,138
  • 12
  • 45
  • 68

1 Answers1

1

I just read your question and found it quite interesting. I did something similar using ntlk (python Natural Language Toolkit). Anyway, in this case I think you dont need the sophisticated string comparison algorithms.

So I tried an approach using the python difflib. The Title sounds promising: difflib — Helpers for computing deltas

The difflib.SequenceMatcher class says:

This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.

By the way I think that if you want to save time you could hold and process 2.000.000 3-tuples of (relatively short) strings easily in Memory. (see testruns and Mem Usage below)

So I wrote a demo App that produces 2.000.000 (you can vary that) 3-tuples of randomly slightly shuffled strings. The shuffled strings are based and compared with a default pattern like yours: ['foofoo', 'bar', 'lorem']. It then compares them using difflib.SequenceMatcher. All in Memory.

Here is the compare code:

def compare(intuple, pattern_list):
    """
    compare two strings with difflib
    intuple: in this case a n-tuple of strings
    pattern_list: a given pattern list.
    n-tuple and list must be of the same lenght.

    return a dict (Ordered) with the tuple and the score
    """
    d = collections.OrderedDict()
    d["tuple"] = intuple
    #d["pattern"] = pattern_list
    scorelist = []
    for counter in range(0,len(pattern_list)):
        score = difflib.SequenceMatcher(None,intuple[counter].lower(),pattern_list[counter].lower()).ratio()
        scorelist.append(score)
    d["score"] = scorelist
    return d

Here are the runtime and Memory usage results:

2000 3-tuples: - compare time: 417 ms = 0,417 sec - Mem Usage: 594 KiB

200.000 3-tuples: - compare time: 5360 ms = 5,3 sec - Mem Usage: 58 MiB

2.000.000 3-tuples: - compare time: 462241 ms = 462 sec - Mem Usage: 580 MiB

So it scales linear in time and Mem usage. And it (only) needs 462 seconds for 2.000.000 3-tuple strings tom compare.

The result looks like this:(example for 200.000 rows)

[ TIMIMG ]
build                function took 53304.028034 ms
[ TIMIMG ]
compare_all          function took 462241.254807 ms
[ INFO ]

num rows: 2000000
pattern: ['foofoo', 'bar', 'lorem']
[ SHOWING 10 random results ]
0: {"tuple": ["foofoo", "bar", "ewrem"], "score": [1.0, 1.0, 0.6]}
1: {"tuple": ["hoofoo", "kar", "lorem"], "score": [0.8333333333333334, 0.6666666666666666, 1.0]}
2: {"tuple": ["imofoo", "bar", "lorem"], "score": [0.6666666666666666, 1.0, 1.0]}
3: {"tuple": ["foofoo", "bar", "lorem"], "score": [1.0, 1.0, 1.0]}
....

As you can see you get an score based on the similarity of the string compared to the pattern. 1.0 means equal and everything below gets worse the lower the score is.

difflib is known as not to be the fastest algorithm for that but I think 7 minutes is quite an improvement to half a day or 5 hours.

I hope this helps you (and is not complete missunderstanding) but it was a lot of fun to program this yesterday. And I learned a lot. ;) For example to track memory usage using tracemalloc. Never did that before.

I dropped the code to github (as a one file gist).

Community
  • 1
  • 1
klaas
  • 1,661
  • 16
  • 24
  • i havent got time to look at the solution, can i use it to "score" entries with multiple terms? e.g. "foo bar" vs. "fooz bar" – Jeffrey04 Apr 17 '16 at 09:11
  • should work as well. difflib uses hashes to compare. so anything hashable will work. – klaas Apr 17 '16 at 10:33
  • lol, doesn't look like the tool I need. Because I don't have all the possible (relatively) correct canonical values and combinations for each field before hand. – Jeffrey04 Apr 18 '16 at 03:00
  • You can do the comparison exactly at that time when you would do the compare run in the DB. Just like you stated in your question. As I understand you have 2.2 Mio Datasets and a compare pattern then. Thats fine. Just read anything into Mem and run the compare function. There is no need to have anything beforehand. – klaas Apr 18 '16 at 07:29
  • not sure if it is relevant, but i have updated the question with another example – Jeffrey04 Apr 19 '16 at 07:55
  • Do you compare those to another pattern or also to foo, bar, baz ?? You can justgive it a try. Select all your rows. Put them in tuples (foo,bar,baz)... make a loop and call compare(tuple, pattern_list) that'll return the dict with the scores per call. See compare function above. – klaas Apr 19 '16 at 10:58
  • so currently all records are placed in a for loop (multiprocessing TBE), so for each record, I fire an SQL shown in the question, to get a statistics of the possible values for each field, then score the respective field accordingly – Jeffrey04 Apr 19 '16 at 11:24
  • So I don't have a known list of pattern, and the pattern actually only applies to the dataset only. – Jeffrey04 Apr 19 '16 at 11:25