1

I have a data frame and in some of the rows, for one of the columns, I have a 1D array. for example: fig1

how I can count the number of values in the arrays in column data (separated by comma) for each row and show the number of them in a new column of new data frame same as fig 2:

fig2

SeaBean
  • 22,547
  • 3
  • 13
  • 25
WhoIsKi
  • 117
  • 1
  • 14
  • Welcome to Stack Overflow! Please include any relevant information [as text directly into your question](https://stackoverflow.com/editing-help), do not link or embed external images of source code or data. Images make it difficult to efficiently assist you as they cannot be copied and offer poor usability as they cannot be searched. See: [Why not upload images of code/errors when asking a question?](https://meta.stackoverflow.com/q/285551/15497888) – Henry Ecker Jun 13 '21 at 16:00
  • If you need assistance formatting a small sample of your DataFrame as a copyable piece of code for SO see [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/15497888). – Henry Ecker Jun 13 '21 at 16:00
  • Also why is John `7` and not `6`? – Henry Ecker Jun 13 '21 at 16:01
  • @HenryEcker thanks! You're right! 6 is true. – WhoIsKi Jun 13 '21 at 16:04

1 Answers1

1

You can use .str.len() to get the item count in lists in column data and then use .groupby() to aggregate the count of same name using .sum(), as follows:

df_out = (df['data'].str.len()
                    .groupby(df['name'], sort=False).sum()
         ).to_frame(name='data_count').reset_index()

Result:

print(df_out)


     name  data_count
0    john           6
1  amanda           0
2    sara           5

Edit

If the column data consists of strings looking like arrays/lists, instead of the 1D array as mentioned in the question, you can run the following code to convert the column into real arrays/lists first:

df['data'] = df['data'].str.strip('[]').str.replace("'", "").str.replace('"', '').replace('', np.nan).str.split(',').fillna({i: [] for i in df.index})

Test Run

Test Data Setup

nan = np.nan
# dict of dataframe dump by df.to_dict() as provided by OP in the comment:
data = {'name': {0: 'john', 1: 'amanda', 2: 'sara', 3: 'john'}, 'data': {0: '[a4G, bweQ, fp_dE4]', 1: nan, 2: '[H2dw45, IfC4, bAf23g, Lkfr54-op, a3dLa]', 3: '[Tr45b, kM30, riU91]'}}
df = pd.DataFrame(data)

df['data'] = df['data'].str.strip('[]').str.replace("'", "").str.replace('"', '').replace('', np.nan).str.split(',').fillna({i: [] for i in df.index})

Run solution codes

df_out = (df['data'].str.len()
                    .groupby(df['name'], sort=False).sum()
         ).to_frame(name='data_count').reset_index()

Result:

print(df_out)


     name  data_count
0    john           6
1  amanda           0
2    sara           5
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • thank you but for my original data, the result of this code is the number of characters. what's your idea? – WhoIsKi Jun 13 '21 at 16:33
  • when I read the sample .xlsx file as a pandas data frame, the result is the number of characters. john : 51.0 , amanda : 0.0 , sara : 50.0 – WhoIsKi Jun 13 '21 at 16:52
  • @Kla Probably your so-called 1-D array is not a real array/list, but string with square brackets at the ends with other substrings inside the long strings. Could you show me the contents of the dataframe by using e.g. `df.to_dict()` ? – SeaBean Jun 13 '21 at 17:50
  • yes, I think you're right. the contents of that column are some strings that separated by a comma. each set of strings is in square brackets. – WhoIsKi Jun 13 '21 at 18:07
  • could you create an excel file with these example data? {'name': {0: 'john', 1: 'amanda', 2: 'sara', 3: 'john'}, 'data': {0: '[a4G, bweQ, fp_dE4]', 1: nan, 2: '[H2dw45, IfC4, bAf23g, Lkfr54-op, a3dLa]', 3: '[Tr45b, kM30, riU91]'}} – WhoIsKi Jun 13 '21 at 18:11
  • @Kia In this case, would you please see my edit and run the code to convert your strings into real 1D array first. Then you will get the result you want. – SeaBean Jun 13 '21 at 18:17
  • How to count without duplicate data ( in column data) ? for example, if I have another a4G ( [Tr45b, kM30, riU91] -> [Tr45b, kM30, riU91, a4G] ) data_count for john isn't 7; it's 6. – WhoIsKi Jun 16 '21 at 14:03
  • @Kia It's a separate question to get unique counts. Surely, there are quite a lot of similar questions on Stackoverflow. I have answered such question too. You can search it using keywords eg. Unique item count pandas, to look for it. – SeaBean Jun 16 '21 at 14:53