OpenRefine: select value based on a variable another column

Question

I have a problem with OpenRefine. I am adding a new column based on a url and from there calling an API for getting some terms from a controlled vocabulary (AAT). I parse the results and I obtain a multivalued cells such as:

http://vocab.getty.edu/aat/300041366||aquatints (prints)::http://vocab.getty.edu/aat/300053242||aquatint (printing process)::http://vocab.getty.edu/aat/300191265||dust bags::http://vocab.getty.edu/aat/300191278||dust boxes::http://vocab.getty.edu/aat/300191278||dust boxes::http://vocab.getty.edu/aat/300191278||dust boxes::http://vocab.getty.edu/aat/300249564||aquatinters::http://vocab.getty.edu/aat/300249564||aquatinters::http://vocab.getty.edu/aat/300249564||aquatinters::http://vocab.getty.edu/aat/300249564||aquatinters::http://vocab.getty.edu/aat/300053242||aquatint (printing process)::http://vocab.getty.edu/aat/300041366||aquatints (prints)::http://vocab.getty.edu/aat/300041368||sandpaper aquatints::http://vocab.getty.edu/aat/300041368||sandpaper aquatints

Where I have the current structure:

URI||Corresponding_TERM::URI||Corresponding_TERM

etc.

I now need to choose one of those records. My solution is to use
value.split("::")[0]
in order to choose which element I want.
Unfortunately this solutions has very evident drawbacks, because the order of the elements in the array is not constant, so if the first element [0] would be the right one for one record, it would probably not be for the next one.

For explain myself better, I now have this kind of structure

-----------------------------------------------------------
|ID | Classification | Term_From_Thesaurus                |  
| 1 | Aquatints      | uri||term1::uri||term2::           | 
| 1 | Aquatints      | uri||term1::uri||term2::           | 
| 2 | Drypoints      | uri||term3::uri||term4::           |
| 3 | Woodcut        | uri||term5::uri||term6::uri||term7 |
-----------------------------------------------------------

And I need to associate term1 with Aquatints, term 4 with Drypoints and term 7 with Woodcut.

How can I do that? A solution could be using facet and a lot of manual work, but maybe there is a better one? What about going to each record and if ID = 1 they should use term1, if ID=2 should use term 4 etc. Would it be possible? I sincerely do not know how to use the value of another column as variable to determine the result of an operation. cell.cross would help, but in case I need to split the data into two files, and doesn’t seems to me a proper solution..

I don't understand how you determine that Aquatints corresponds to term 2 (why not to term1 ?) or Woodcuts to term 7. What is the logic? Simply that term 7 looks like "woodcuts" or contains that word? — Ettore Rizza, May 21 '17 at 21:55
Hi, I would have to define it myself based on my knowledge of the data/field. It is just a process of term normalisation based on some API because I cannot use a proper reconcile and matching service. Any other ideas/strategy are welcome! :-) — K3it4r0, May 21 '17 at 22:16

score 2 · Accepted Answer · answered May 22 '17 at 09:16

So I'm not sure if I've understood your question correctly, but it is possible to "select value based on a variable in another column".

If you have:

-----------------------------------------------------------
|ID | Classification | Term_From_Thesaurus                |  
| 1 | Aquatints      | uri||term1::uri||term2::           | 
| 1 | Aquatints      | uri||term1::uri||term2::           | 
| 2 | Drypoints      | uri||term3::uri||term4::           |
| 3 | Woodcut        | uri||term5::uri||term6::uri||term7 |
-----------------------------------------------------------

Then if you split the 'Term_From_Thesaurus' column into an array, then you can use the number in the 'ID' column to select the relevant entry in the array. However, note that for this to work you need to have the number in the ID column to be converted into a Number type (if it isn't already). In this example I'll assume that the number in the ID column starts off as a String rather than Number.

So the formula:

value.split("::")[cells.ID.value.toNumber()-1]

Will find the first value in the first and second row, the second value in the third row and the 4th item in the 4th row. This is illustrated here:

The formula breaks down as follows:

value.split("::") = splits the list of URI/Term pairs into an array
cells.ID.value.toNumber() = converts the value in the ID column into a number type
-1 = because array members are counted from zero

Hope this is clear

score 1 · Answer 2 · answered May 21 '17 at 22:26

1

If you have to choose the right term yourself according to your knowledge of the field, I do not see how to automate the operation. Here is a solution to match each term in your "Classification" column with the one that most closely resembles to it amongst the terms returned by the API.

The Grel formula used is :

value.fingerprint() == cells.terms.value.replace(/\(.+\)/,'').fingerprint()

answered May 21 '17 at 22:26

Ettore Rizza

2,800
2
11
23

Thanks, this could help quite a lot. I knew that my question would probably not find an easy answer, but I tried :) – K3it4r0 May 21 '17 at 23:02
I don't know what API you used, but it returns too many results. Have-you tried to play with the Getty Sparql Endpoint ? I know that the Deri RDF extension doesn't work with this endpoint, but a Sparql request is fundamentaly the same as an API call in Open Refine : you have to construct an URL, fetch it, and parse the Json. – Ettore Rizza May 22 '17 at 05:35
(sorry for the delay) I actually tried but with not success. Going to try that again and in case ask more details in the AAT google group :) Thanks for the tip! – K3it4r0 May 30 '17 at 07:45

OpenRefine: select value based on a variable another column

2 Answers2