Starting from this dataframe df
:
node1,node2,lang,w,c1,c2
1,2,it,1,a,a
1,2,en,1,a,a
2,3,es,2,a,b
3,4,it,1,b,b
5,6,it,1,c,c
3,5,tg,1,b,c
1,7,it,1,a,a
7,1,es,1,a,a
3,8,es,1,b,b
8,4,es,1,b,b
1,9,it,1,a,a
I performed a groupby
operation like:
g = df.groupby(['c1','c2'])['lang'].unique().reset_index()
results in:
c1 c2 lang
0 a a [it, en, es]
1 a b [es]
2 b b [it, es]
3 b c [tg]
4 c c [it]
Saving to .csv and read it back:
g.to_csv('myfile.csv')
g = pd.read_csv('myfile.csv')
obtaining a different format of the lang
column:
c1 c2 lang
0 a a ['it' 'en' 'es']
1 a b ['es']
2 b b ['it' 'es']
3 b c ['tg']
4 c c ['it']
My goal now is to count the number of items in each row of lang
, and be able to get those values individually. I tried to build a new column with the length of the array of string:
g['len'] = df['lang'].apply(lambda x: x.size)
obtaining:
AttributeError: 'str' object has no attribute 'size'
Looking up the values of the lang
column, I realized that after the groupby that column became a mess:
In [113]: g['lang'].values
Out[113]: array(["['it' 'en' 'es']", "['es']", "['it' 'es']", "['tg']", "['it']"], dtype=object)
How can I obtain the length of each nested string array and then get the values of each string within it? I thought in this type of conversion but my case is a little too complicated.
EDIT: add information about the different format of the lang
column before and after writing/reading to/from .csv.