0

I have a dataframe created like below, with countries in JSON format:

df = pd.DataFrame([['matt', '''[{"c_id": "cn", "c_name": "China"}, {"c_id": "au", "c_name": "Australia"}]'''],
                   ['david', '''[{"c_id": "jp", "c_name": "Japan"}, {"c_id": "cn", "c_name": "China"},{"c_id": "au", "c_name": "Australia"}]'''],
                   ['john', '''[{"c_id": "br", "c_name": "Brazil"}, {"c_id": "ag", "c_name": "Argentina"}]''']],
                  columns =['person','countries']) 

I'd like to have the output as below, with just the country names, separated by a comma and sorted in alphabetical order:

result = pd.DataFrame([['matt', 'Australia, China'],
                   ['david', 'Australia, China, Japan'],
                   ['john', 'Argentina, Brazil']],
                  columns =['person','countries']) 

I tried doing this using a few methods, but none worked successfully. I was hoping the below would split the JSON format appropriately, but it didn't work out - perhaps because the JSONs are in string format in the dataframe?

result = pd.io.json.json_normalize(df, 'c_name')
Programmer
  • 1,266
  • 5
  • 23
  • 44
  • 1
    it's not valid JSON - ie, no `"` around `c_id` key values... `cn`, `au` etc, so it won't parse – Chris Adams Mar 12 '20 at 11:21
  • @ChrisA Sorry that was my mistake, I edited it. – Programmer Mar 12 '20 at 11:26
  • Maybe try `df['countries'] = [', '.join([d.get('c_name') for d in l]) for l in df['countries'].apply(json.loads)]` – Chris Adams Mar 12 '20 at 11:30
  • @ChrisA seems to work for this example, but in my real dataset I get `JSONDecodeError: Expecting value: line 1 column 3102 (char 3101)` – Programmer Mar 12 '20 at 11:35
  • again, must be invalid JSON - you've got a `"key" : ` with out an associated `"value"` somewhere around column 3102 – Chris Adams Mar 12 '20 at 11:37
  • Ah, is there a way to fix that? I already had to use `df.replace({'\'': '"'}, regex=True)` to replace all single quotes with double quotes, as I believe JSON mandates that? – Programmer Mar 12 '20 at 11:39
  • That's right yeah. I don't know of any other way to fix it other than more regex replacing unfortunately. If there is a better way, someone on here will know. Maybe [this](https://stackoverflow.com/questions/18514910/how-do-i-automatically-fix-an-invalid-json-string) post could help? – Chris Adams Mar 12 '20 at 11:42

1 Answers1

1

One solution could be to use ast.literal_eval to treat the string as a list of dictionaries:

import ast

df["countries"] = df["countries"].map(lambda x: ast.literal_eval(x))
df["countries"] = df["countries"].map(lambda x: sorted([c["c_name"] for c in x]))
datapug
  • 2,261
  • 1
  • 17
  • 33