Breaking up column of strings into multiple columns (no standard format)?

Question

I have an excel file which I have imported into R. The problem is that there are over 100,000 rows that have data that looks like the following in a single cell:

{'gender': 'Male', 'nationality': 'IRL', 'document_type': 'passport', 'date_of_expiry': '2019-08-12', 'issuing_country': 'IRL'}

So what happens is that R also places this entire information into one box in my table. I would like R to break this single column into multiple columns such that 'gender', 'nationality', 'document_type' etc. each have their own column with entries 'Male', 'IRL', 'passport' etc. respectively for this row. The problem is that some information is missed out in other cells e.g. another cell may look like this:

{'gender': 'Female', 'document_type': 'driving_licence', 'date_of_expiry': '2023-02-28', 'issuing_country': 'GBR'}

and another like this:

{'document_type': 'driving_licence', 'date_of_expiry': '2044-12-14', 'issuing_country': 'GRC'}

How do I assign these entries to their own columns and have empty values where no information is provided. I am a beginner R user. I have tried using package "qdapRegex" because they have a handy tool where I can extract characters between two other characters/strings but to no avail. I think if I could figure out how to apply operators like '|' to strings I would be able to do this with my way.

After Input:

input <- read.csv(text=r"(
ID, Properties
1,"{'gender': 'Male', 'nationality': 'IRL', 'document_type': 'passport', 'date_of_expiry': '2019-08-12', 'issuing_country': 'IRL'}"
2,"{'document_type': 'national_identity_card', 'date_of_expiry': '2027-01-23', 'issuing_country': 'CYP'}"
3,"{'date_of_expiry': '2019-01-28', 'issuing_country': 'JEY'}"
4,"{'document_type': 'driving_licence', 'date_of_expiry': '2006-06-14', 'issuing_country': 'POL'})")

Desired Output:

output <- read.csv(text="
ID,gender,nationality,document_type,date_of_expiry,issuing_country
1,Male,IRL,passport,2019-08-12,IRL
2,,,national_identity_card,2027-01-23,CYP
3,,,,2019-01-28,JEY
4,,,driving_licence,2006-06-14,POL
")

It looks like you have a column of JSON data. There are JSON parsers in R that can help. It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Jul 16 '21 at 18:37
That was much better, I turned the data into proper R objects. All those pipes and dashes still make it hard to copy/paste into R for testing. We're trying to prevent having to retype of bunch of stuff just to test the code. — MrFlick, Jul 16 '21 at 19:22

score 0 · Accepted Answer · answered Jul 16 '21 at 19:29

0

So your data is very JSON like. The only "problem" is that it uses single quotes rather than double quotes. If we swap the quotes, we can more easily parse the data. Here's some code using a few helper pacakges to get the job done

library(purrr)
library(dplyr)
library(jsonlite)
library(tidyr)

input %>% 
  mutate(Properties = gsub("'", "\"", Properties)) %>% 
  mutate(Properties = map(Properties, fromJSON)) %>% 
  unnest_wider(Properties)

#      ID gender nationality document_type          date_of_expiry issuing_country
#   <int> <chr>  <chr>       <chr>                  <chr>          <chr>          
# 1     1 Male   IRL         passport               2019-08-12     IRL            
# 2     2 NA     NA          national_identity_card 2027-01-23     CYP            
# 3     3 NA     NA          NA                     2019-01-28     JEY            
# 4     4 NA     NA          driving_licence        2006-06-14     POL

When we parse the JOSN we get a named list of values. The tidyr::unnest_wider function will turn that list of values into columns.

answered Jul 16 '21 at 19:29

MrFlick

195,160
17
277
295

Error: Problem with `mutate()` column `properties`. i `properties = map(properties, fromJSON)`. x lexical error: invalid char in json text. {"gender": None, "issuing_date": "2007-07- – sthurman Jul 16 '21 at 19:36
This is the error I am receiving, unfortunately I'm not too familiar with JOSN text but the error seems to be with the entry None I think and perhaps the fact it does not contain speech marks around it – sthurman Jul 16 '21 at 19:37
Yeah "None" is not a valid JSON token. Missing values should be "null". Where is this data coming from? It looks like these are raw python dict objects which is odd. I don't understand how that would wind up in an Excel document. – MrFlick Jul 16 '21 at 19:49
All I have is this excel file and I want to carry out some statistical analysis on it. Is it possible to replace every instance of none in this column with null and rerunning the code? – sthurman Jul 16 '21 at 19:54
That does seem possible. I suggest you try that. It's not easy for me to test without a reproducible example. – MrFlick Jul 16 '21 at 19:56

Breaking up column of strings into multiple columns (no standard format)?

1 Answers1