Suppose I am given a set of structured data. The data is known to be problematic, and I need to somehow "score" them on consistency. For example, I have the data as shown below:
fieldA | fieldB | fieldC
-------+--------+-------
foo | bar | baz
fooo | bar | baz
foo | bar | lorem
.. | .. | ..
lorem | ipsum | dolor
lorem | upsum | dolor
lorem | ipsum | baz
So assume the first row is considered the correct entry because there are relatively more data in that combination compared to the records in second and third row. In the second row, the value for fieldA
should be foo
(inconsistent due to misspelling). Then in the third row, the value of fieldC
should be baz
as other entries in the dataset with similar values for fieldA
(foo
) and fieldB
(bar
) suggest.
Also, in other part of the dataset, there's another combination that is relatively more common (lorem
, ipsum
, dolor
). So the problem in the following records are the same as the one mentioned before, just that the value combination is different.
I initially dumped everything to a SQL database, and use statements with GROUP BY
to check consistency of fields values. So there will be 1 query for each field I want to check for consistency, and for each record.
SELECT fieldA, count(fieldA)
FROM cache
WHERE fieldB = 'bar' and fieldC = 'baz'
GROUP BY fieldA
Then I could check if the value of fieldA
of a record is consistent with the rest by referring the record to the object below (processed result of the previous SQL query).
{'foo': {'consistency': 0.99, 'count': 99, 'total': 100}
'fooo': {'consistency': 0.01, 'count': 1, 'total': 100}}
However it was very slow (dataset has about 2.2million records, and I am checking 4 fields, so making about 9mil queries), and would take half a day to complete. Then I replaced SQL storage to elasticsearch, and the processing time shrunk to about 5 hours, can it be made somehow faster?
Also just out of curiosity, am I re-inventing a wheel here? Is there an existing tool for this? Currently it is implemented in Python3, with elasticsearch.