2

I've got two datasets, with common column name that are just slightly different.

Dataset A:

Name,Number
Hartlepool UA,306
Tyne and Wear Met County,683

Dataset B:

Name,Number
Hartlepool,366
Tyne and Wear,353

I thought Google Refine / Open Refine was designed to help reconcile this sort of messy data in a fuzzy way, but it seems not.

Is there any tool I can use to match these in a fuzzy way, without having to write code? I can write code easily enough if needed, but I'm surprised that Google Refine does not seem to do what I though it did.

Community
  • 1
  • 1
Richard
  • 62,943
  • 126
  • 334
  • 542

3 Answers3

2

You could also just:

  1. Strip all punctuation and consider the address cell as one big long string, if it's not already in one cell, you need to use some join()'s. Once it is a big string in a single cell, then use the Facet -> Custom Text Facet -> fingerprint() GREL function.

    fingerprint(value)
    value.fingerprint()
    
  2. Using that big long string, you can also perform your fuzzy analysis with Clustering or with a Custom Text Facet and

    value.ngramFingerprint(2)
    

Those are probably still not the best solutions for your type of messy data.... Addresses. especially, if your considering your 2 examples to be "Same As", which they are not. And so you will need to take the approach that Tom suggests.

But if your really trying to say...hey, most of the words and numbers in these 2 address are the same, and I think they are pretty close in my context, and want to measure in some way the closeness of the 2 strings fingerprints... then the above 2 steps should get you pointed in the right direction.

Thad Guidry
  • 579
  • 4
  • 8
2

Maybe this [new?] reconciliation service implementation could work and make it easier? http://okfnlabs.org/reconcile-csv/

Reconcile-csv is a reconciliation service for OpenRefine running from a CSV file. It uses fuzzy matching to match entries in one dataset to entries in another dataset, helping to introduce unique IDs into the system - so they can be used to join your data painlessly.

lmsurprenant
  • 1,723
  • 2
  • 14
  • 28
1

There are two strategies which could work in OpenRefine:

  1. Reconcile against Freebase, DBpedia (using the RDF extension), or some other source of strong identifiers and then matching using the identifiers.

  2. Add a column to each dataset with its source (A or B), rearrange the columns to be in the same order, import the two datasets into a single Refine project and then using the clustering to find the duplicates.

Tom Morris
  • 10,490
  • 32
  • 53