1

I have data frame in the below format. The description is in string format.

file description
x [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]
y [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'], dtype=object), array([0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457])]]

How can i convert data Frame into below format.

file license score
x ['MIT', 'MIT', 'MIT', 'MIT', 'MIT'] [0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]
y ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'] [0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457]

​Above is just an example. Data frame is very large.

I'mahdi
  • 23,382
  • 5
  • 22
  • 30
sudojarvis
  • 37
  • 5

2 Answers2

1

Update, If elements in the column as string format, you can find array with regex formula. (Note don't use eval, Why should exec() and eval() be avoided?)

import ast
new_cols = lambda x: pd.Series({'licence':ast.literal_eval(x[0]), 
                                'score':ast.literal_eval(x[1])})

df = df.join(df['m'].str.findall(r'\[[^\[\]]*\]').apply(new_cols)).drop('m', axis=1)
print(df)

Output:

    file    licence                                              score
0   x       ['MIT', 'MIT', 'MIT', 'MIT', 'MIT']                     [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1   y       ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', ...    [0.28552457, 0.28552457, 0.28552457, 0.2855245...

How regex formula find arrays: (find string start with [ and end with ] but in finding string should not have [ or ] to find all arrays.)

>>> import re
>>> re.findall(r'\[[^\[\]]*\]', "[[np.array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), np.array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]",)
["['MIT', 'MIT', 'MIT', 'MIT', 'MIT']",
 '[0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]']

Old, You can create new column then join with old dataframe.

new_cols = lambda x: pd.Series({'licence':x[0][0], 'score':x[0][1]})
df = df.join(df['m'].apply(new_cols)).drop('m', axis=1)
print(df)
I'mahdi
  • 23,382
  • 5
  • 22
  • 30
  • sorry i forgot to mention that the description is in string format. – sudojarvis Jul 10 '22 at 03:25
  • There are (although few) cases where `eval` may be acceptable, I'd argue that this could be one of those cases, as long as it's clearly known and controlled by OP where the data is coming from~ – BeRT2me Jul 10 '22 at 04:11
  • @BeRT2me, in you code If in this column user write as string remove all file from os, your code remove all file from os, because you eval this operation – I'mahdi Jul 10 '22 at 04:12
  • @I'mahdi and? If I control the data, it doesn't matter. That's only if there is an outside attack vector possible. – BeRT2me Jul 10 '22 at 04:14
  • @I'mahdi thank for your help and time. is there any way i can speed up this for 5 million data. – sudojarvis Jul 10 '22 at 05:06
  • @sudojarvis, do you check this, is this slow? 5 million row? or 5 million array in each row? – I'mahdi Jul 10 '22 at 05:07
  • @I'mahdi yup i checked it. For 5 million rows. – sudojarvis Jul 10 '22 at 05:11
  • @sudojarvis, OK, is this slow? Is this not work for 5 million? – I'mahdi Jul 10 '22 at 05:13
  • For small data its working fine.but for 5 million it is still processing. – sudojarvis Jul 10 '22 at 05:15
  • @sudojarvis, I recommend you, ask a new question. In new question ask about how find pattern in string in 5 million row in pandas. tag this question with [tag:performance], [tag:optimization]. I know here, Users that help you exist. – I'mahdi Jul 10 '22 at 05:33
0

Input:

  file                                        description
0    x  [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], d...
1    y  [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', '...

Doing:

import ast

df.description = (df.description.str.replace('array', '')
                    .str.replace(', dtype=object', '')
                    .apply(ast.literal_eval))
df[['license', 'score']] = [(x[0], x[1]) for x in df.description.str[0]]
df = df.drop('description', axis=1)
print(df)

Output:

  file                                            license                                              score
0    x                          [MIT, MIT, MIT, MIT, MIT]  [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1    y  [APSL-1.0, APSL-1.0, APSL-1.0, APSL-1.0, APSL-...  [0.28552457, 0.28552457, 0.28552457, 0.2855245...
BeRT2me
  • 12,699
  • 2
  • 13
  • 31