0

I have a data frame (df). The Data frame contains a string column called: supported_cpu. The (supported_cpu) data is a string type separated by a comma. I want to use this data for the ML model.

enter image description here

I had to get unique values for the column (supported_cpu). The output is a (list) of unique values.

def pars_string(df,col):
    #Separate the column from the string using split
    data=df[col].value_counts().reset_index()
    data['index']=data['index'].str.split(",")
    # Create a list including all of the items, which is separated by  column
    df_01=[]
    for i in range(data.shape[0]):
        for j in data['index'][i]:
            df_01.append(j)
    # get unique value from sub_df
    list_01=list(set(df_01))
    # there are some leading or trailing spaces in the list_01 which need to be deleted to get unique value 
    list_02=[x.strip(' ') for x in list_01]
    # get unique value from list_02
    list_03=list(set(list_02))
    return(list_03)

supported_cpu_list = pars_string(df=df,col='supported_cpu')

The output:

enter image description here

I want to map this output to the data frame to encode it for the ML model.

How could I store the output in the data frame? Note : Some row have a multi-value(more than one CPU)

Input: string type separated by a column output: I did not know what it should be.

Input: string type separated by a column output: I did not know what it should be.

D_D
  • 3
  • 1
  • 4

2 Answers2

0

I really recommend to anyone who's starting using pandas to read about vectorization and thinking in terms of columns (aka Series). This is the way it was build and it is the way in which its supposed to be used.

And from what I understand (I may be wrong) is that you want to get unique values from supported_cpu column. So you could use the Series methods on string to split that particular column, then flatten the resulting array using internal `chain

from itertools import chain

df['supported_cpu'] = df['supported_cpu'].str.split(pat=',')

unique_vals = set(chain(*df['supported_cpus'].tolist()))
unique_vals = (item for item in unique_vals if item) 
Hrimiuc Paul
  • 108
  • 6
  • I have a unique value list; I want to map the unique value list to the data frame again instead of a multi-value string column. – D_D Feb 16 '23 at 21:07
  • Then you can only use the first line, which splits the column for every row ```df['supported_cpu'] = df['supported_cpu'].str.split(pat=',')``` – Hrimiuc Paul Feb 17 '23 at 13:46
0

Multi-values in some rows should be parsed to single values for later ML model training. The list can be converted to dataframe simply by pd.DataFrame(supported_cpu_list).

sicilyxw
  • 11
  • 1
  • I need to connect it with the original data frame, not stand-alone data frame – D_D Feb 16 '23 at 21:08
  • ok. I may understand your problem incorrectly. How about you try this: First, split the multiple values in the original dataframe by @Hrimiuc Paul 'df['supported_cpu'] = df['supported_cpu'].str.split(pat=',')'. Then, split up the supported_cpu columns over multiple rows by 'df=df.explode('supported_cpu')'. Then...drop the duplicated cpu by 'df.drop_duplicates(subset = "supported_cpu")'... – sicilyxw Feb 18 '23 at 11:06