How to do a for loop in openrefine

Question

I'm using OpenRefine to manage some data. I have a column named "Country" that has countries around the world. However,some values are some states names of U.S instead of "United States". I want to change them to a single value "United States".

This is my idea below. I'm expressing it in a python-ish way. Basically I want to create a loop that evaluate if the each value in the column equals to any of the state name, and when it's true, replace it to the value "United States".

list = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']

for i in list:
   if cells["Location 2”].value = i:
      value.replace(cells["Location 2”].value, i)

Any Idea how to do it with syntax that works on the OpenRefine custom text transform feature?

score 1 · Answer 1 · answered Nov 04 '22 at 18:32

There are several solutions to your problem. I would choose solution 2 or 3 depending on your Python skills.

Solution 1: Python/Jython

If you are familiar with Python you could change the expression language to Python/Jython and then just write Python 2.7 code.

Following your example the code for the transformation dialog would look like:

list = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']

for i in list:
   if value == i:
      return "United States"
return value

Solution 2: Python/Jython using set

An improved version would be to use a set instead of a list and skip the for loop:

states = set(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'])

return "United States" if value in states else value

Solution 3: GREL

If you want to use GREL instead of Python/Jython the solution would look like the following code snippet using an if-control and array functionality from GREL.

if(
    ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'].inArray(value),
    "United States",
    value
)

Solution 4: Text Facet

In cases where you do not have list of the values to extract available you could use the text facet functionality from OpenRefine to filter the rows you want and then use the text transformation dialog with a fixed value like "United States". Note the quotes that tells the GREL interpreter to use the string "United States" instead of trying to interpret United States as GREL function.

Solution 5: Reconciliation

Sometimes you have quite mixed data but with a normalized vocabulary in a column. One trick so separate such a column is to use reconciliation against a data source like Wikidata that already has the data sorted into categories/types/.... After the reconciliation process you can then load the category/type/... as additinonal column that can be used to filter the data.