0

Given this object:

 {
    "script": "Georgian",
    "id": 7,
    "script_family": "European",
    "direction": "LTR",
    "num_languages": 11,
    "type": "Alphabet",
    "date": 500,
    "Continent": ""
  },
  {
    "script": "Armenian",
    "id": 8,
    "script_family": "European",
    "direction": "RTL",
    "num_languages": 1,
    "type": "Alphabet",
    "date": 500,
    "Continent": ""
  },
  {
    "script": "Tamil",
    "id": 9,
    "script_family": "Indic",
    "direction": "LTR",
    "num_languages": 6,
    "type": "Syllabary",
    "date": 800,
    "Continent": ""
  },
  {
    "script": "Tibetan",
    "id": 10,
    "script_family": "Central Asian",
    "direction": "LTR",
    "num_languages": 45,
    "type": "Abugida",
    "date": 800,
    "Continent": ""
  },
  {
    "script": "Khmer",
    "id": 11,
    "script_family": "Mainland Southeast Asian",
    "direction": "LTR",
    "num_languages": 3,
    "type": "Abugida",
    "date": 900,
    "Continent": ""
  },

I want to make an array of objects that looks like this where it is grouped by date and contains the number of scripts that appear in that date for each script family.

data = [
{date: 500, European: 2}
{date: 800, Indic: 1, Central Asia: 1}
...
]

Where sometimes a data can have multiple script families.

I tried this code: family = data.groupby(['date', 'script_family'])['script_family'].count() But when I export it as a csv, I only get the count of "script_families" though I want each script_family that appears at the specific date to be set to the number of scripts.

date   script_family           
-400   European                    1
-300   East Asian                  1
-200   Middle Eastern              1
-100   European                    1
 500   African                     1
       European                    2
 600   Middle Eastern              1
 800   Central Asian               1
       Indic                       1
 900   East Asian                  1
       European                    1
       Indic                       3
       Mainland Southeast Asian    1
 1000  Indic                       1
 1100  Indic                       2
       Mainland Southeast Asian    1
 1200  Indic                       1
 1300  Central Asian               1
       Mainland Southeast Asian    1
...

1 Answers1

0

Works in Python 2.7.18 and 3.9.1:

from collections import Counter
from itertools import groupby
from operator import itemgetter

data = ...  # load the data
data = sorted(data, key=itemgetter('date'))  # groupby needs sorted data

results = [
    dict(
        date=date,
        **(Counter(map(itemgetter('script_family'), dated_scripts)))) 
        for date, dated_scripts in groupby(data, key=itemgetter('date')
    )
]
print(results)

Reference:

duthils
  • 1,181
  • 3
  • 7
  • thanks for answering! I keep getting the error: TypeError: string indices must be integers even though 'date' is an int. Do you know what I am doing wrong? – Maxene Graze Feb 12 '21 at 16:28
  • Which line raises the error? Did you `json.load` the data? – duthils Feb 13 '21 at 16:47
  • I loaded it with a CSV here: https://www.kaggle.com/maxenegraze/notebookda9aef5c4a and I get the first issue at the data = sorted.... portion. – Maxene Graze Feb 13 '21 at 22:11
  • The error says that `data` is a list of strings, or a dict of strings, or that one of the items in data is a string. The snippet above expects that `data` is a list of dicts, like the example input you provided. `data[0]['date']` should return the date of the first script from `data` – duthils Feb 14 '21 at 01:37
  • Ok understood. Is it possible to do this transformation with a data frame instead and would you know how? Or is it better to convert to a dict? – Maxene Graze Feb 15 '21 at 19:38
  • Ah, that is an important new information :) You can make list of dict with `df.to_dict('records`)` but it is easier to use the dataframe utilities, check [this answer](https://stackoverflow.com/a/32801170/13420860), which is more appropriate. – duthils Feb 15 '21 at 22:03