2

I have a column with 100,000+ strings in it. I wish to have Google Refine replace these strings with their Fingerprint.

I selected the column in Google Refine, and created a Text Facet. From that Text Facet I can select "Cluster". This will show me the clusters, which I assume to mean string values that have the same fingerprint, and allow me to select a New Cell Value, which defaults to the name of the first member of the cluster.

I wish for this name to just be the fingerprint. The reason is, I need to do this operation to multiple files and I need them to be the same value if they are indeed part of the same cluster. I cannot concatenate the files, as this results in too much data for Refine to handle, despite optimizing the memory parameters as per the Refine FAQ.

So I am simply looking for an operation that takes each cell in a column, calculates its Fingerprint, and replaces the value in the column with its Fingerprint.

I am using Google Refine 2.5 on OSX 10.7

Brian Feeny
  • 441
  • 4
  • 14

1 Answers1

2

Text facets with thousands of choices are going to bog down your browser. If you're only using the facet as a means to access clustering, you can get to the same functionality by using Edit Cells -> Cluster and Edit

To compute the fingerprint use the aptly named fingerprint function ie value.fingerprint() although I'd recommend adding a new column rather than overwriting your original values in case you find you need them again.

Tom Morris
  • 10,490
  • 32
  • 53