3

Starting from this dataframe df:

node1,node2,lang,w,c1,c2
1,2,it,1,a,a
1,2,en,1,a,a
2,3,es,2,a,b
3,4,it,1,b,b
5,6,it,1,c,c
3,5,tg,1,b,c
1,7,it,1,a,a
7,1,es,1,a,a
3,8,es,1,b,b
8,4,es,1,b,b
1,9,it,1,a,a

I performed a groupby operation like:

g = df.groupby(['c1','c2'])['lang'].unique().reset_index()

results in:

  c1 c2          lang
0  a  a  [it, en, es]
1  a  b          [es]
2  b  b      [it, es]
3  b  c          [tg]
4  c  c          [it]

Saving to .csv and read it back:

g.to_csv('myfile.csv')
g = pd.read_csv('myfile.csv')

obtaining a different format of the lang column:

  c1 c2              lang
0  a  a  ['it' 'en' 'es']
1  a  b            ['es']
2  b  b       ['it' 'es']
3  b  c            ['tg']
4  c  c            ['it']

My goal now is to count the number of items in each row of lang, and be able to get those values individually. I tried to build a new column with the length of the array of string:

g['len'] = df['lang'].apply(lambda x: x.size)

obtaining:

AttributeError: 'str' object has no attribute 'size'

Looking up the values of the lang column, I realized that after the groupby that column became a mess:

In [113]: g['lang'].values
Out[113]: array(["['it' 'en' 'es']", "['es']", "['it' 'es']", "['tg']", "['it']"], dtype=object)

How can I obtain the length of each nested string array and then get the values of each string within it? I thought in this type of conversion but my case is a little too complicated.

EDIT: add information about the different format of the lang column before and after writing/reading to/from .csv.

Community
  • 1
  • 1
Fabio Lamanna
  • 20,504
  • 24
  • 90
  • 122

1 Answers1

3

Just apply len:

In [145]:
g['size'] = g['lang'].apply(len)
g

Out[145]:
  c1 c2          lang  size
0  a  a  [it, en, es]     3
1  a  b          [es]     1
2  b  b      [it, es]     2
3  b  c          [tg]     1
4  c  c          [it]     1
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thanks! Do you know why writing to csv after the groupby and read the file back give me a different format of the lang column? So I can apply your method before saving to file but not after reading it back? – Fabio Lamanna Mar 02 '16 at 11:41
  • by default the index will written out, you maybe reading it back in again which is adding a new column is my guess – EdChum Mar 02 '16 at 11:45
  • It doesn't work on my PC after csv read/write, `gar['lang'].apply(len)` return `[16, 6, 11, 6, 6]`, the length of the strings. IMHO, using pickle instead csv read/write is the good solution here; or `g=pd.read_csv('myfile.csv',converters={'lang': a_very_tricky_function})`. – B. M. Mar 02 '16 at 12:22
  • @B.M. I encountered the same issue when writing/reading back. – Fabio Lamanna Mar 02 '16 at 12:31