2

I have written the following code:

df = pd.read_csv('breast-cancer-wisconsin.data.csv', nrows= 300)
columns_df = pd.DataFrame(columns = ['column_name', 'value'])
columns = df.columns.values
for index, row in df.iterrows():
    for column in columns:
        columns_df = columns_df.append({'column_name': column, 'value': row[column]}, ignore_index=True)

This script reads a csv file to a pandas dataframe and then appends each column name and it's corresponding value to a new dataframe. If I run

print(columns_df[-10:])

I get the following output

                 column_name   value
2300         clump_thickness       5
2301          unif_cell_size       5
2302         unif_cell_shape       7
2303           marg_adhesion       8
2304  single_epith_cell_size       6
2305             bare_nuclei      10
2306             bland_chrom       7
2307           norm_nucleoli       4
2308                 mitoses       1
2309                   class       4
2310         clump_thickness       5
2311          unif_cell_size       3
2312         unif_cell_shape       4
2313           marg_adhesion       3
2314  single_epith_cell_size       4
2315             bare_nuclei       5
2316             bland_chrom       4
2317           norm_nucleoli       7
2318                 mitoses       1
2319                   class       2
2320         clump_thickness       5
2321          unif_cell_size       4
2322         unif_cell_shape       3
2323           marg_adhesion       1
2324  single_epith_cell_size       2
2325             bare_nuclei  -99999
2326             bland_chrom       2
2327           norm_nucleoli       3
2328                 mitoses       1
2329                   class       2
2330         clump_thickness       8
2331          unif_cell_size       2
2332         unif_cell_shape       1
2333           marg_adhesion       1
2334  single_epith_cell_size       5
2335             bare_nuclei       1
2336             bland_chrom       1
2337           norm_nucleoli       1
2338                 mitoses       1
2339                   class       2
2340         clump_thickness       9
2341          unif_cell_size       1
2342         unif_cell_shape       2
2343           marg_adhesion       6
2344  single_epith_cell_size       4
2345             bare_nuclei      10
2346             bland_chrom       7
2347           norm_nucleoli       7
2348                 mitoses       2
2349                   class       4

Unfortunately, this script is not very fast and will take a very long time to execute with large dataframes.

Question: Is there a more elegant/faster way to achieve these results?

A sample from my input data (the csv file):

id,clump_thickness,unif_cell_size,unif_cell_shape,
marg_adhesion,single_epith_cell_size,
bare_nuclei,bland_chrom,norm_nucleoli,mitoses,class  
    1000025,5,1,1,1,2,1,3,1,1,2
    1002945,5,4,4,5,7,10,3,2,1,2
    1015425,3,1,1,1,2,2,3,1,1,2
    1016277,6,8,8,1,3,4,3,7,1,2
    1017023,4,1,1,3,2,1,3,1,1,2
    1017122,8,10,10,8,7,10,9,7,1,4
    1018099,1,1,1,1,2,10,3,1,1,2
    1018561,2,1,2,1,2,1,3,1,1,2
    1033078,2,1,1,1,2,1,1,1,5,2
    1033078,4,2,1,1,2,1,2,1,1,2
    1035283,1,1,1,1,1,1,3,1,1,2
    1036172,2,1,1,1,2,1,2,1,1,2
    1041801,5,3,3,3,2,3,4,4,1,4
    1043999,1,1,1,1,2,3,3,1,1,2
    1044572,8,7,5,10,7,9,5,5,4,4
    1047630,7,4,6,4,6,1,4,3,1,4
    1048672,4,1,1,1,2,1,2,1,1,2
    1049815,4,1,1,1,2,1,3,1,1,2
    1050670,10,7,7,6,4,10,4,1,2,4
    1050718,6,1,1,1,2,1,3,1,1,2
    1054590,7,3,2,10,5,10,5,4,4,4
    1054593,10,5,5,3,6,7,7,10,1,4
    1056784,3,1,1,1,2,1,2,1,1,2
    1057013,8,4,5,1,2,?,7,3,1,4
    1059552,1,1,1,1,2,1,3,1,1,2
    1065726,5,2,3,4,2,7,3,6,1,4
    1066373,3,2,1,1,1,1,2,1,1,2
    1066979,5,1,1,1,2,1,2,1,1,2
    1067444,2,1,1,1,2,1,2,1,1,2
    1070935,1,1,3,1,2,1,1,1,1,2
    1070935,3,1,1,1,1,1,2,1,1,2
    1071760,2,1,1,1,2,1,3,1,1,2
    1072179,10,7,7,3,8,5,7,4,3,4
    1074610,2,1,1,2,2,1,3,1,1,2
    1075123,3,1,2,1,2,1,2,1,1,2
Mr. President
  • 1,489
  • 3
  • 11
  • 21

1 Answers1

1

If performance is important use melt, but order of values is changed:

df = df.melt('id')
print (df.head())
        id         variable value
0  1000025  clump_thickness     5
1  1002945  clump_thickness     5
2  1015425  clump_thickness     3
3  1016277  clump_thickness     6
4  1017023  clump_thickness     4

Or use:

df = df.set_index('id').stack().rename_axis(['id','var']).reset_index(name='val')
print (df.head())
        id                     var val
0  1000025         clump_thickness   5
1  1000025          unif_cell_size   1
2  1000025         unif_cell_shape   1
3  1000025           marg_adhesion   1
4  1000025  single_epith_cell_size   2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252