2

I have a csv file that read using pandas, I' want to split the dataframe in chunks in a specified column:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

list_of_classes=[]
# Reading file
fileName = 'Training.csv'
df       = pd.read_csv(fileName)
classID  = df.iloc[:,-2]
len(classID)
df.iloc[0,-2]
for i in range(len(classID)):
    print(classID[i])
    if classID[i] not in list_of_classes:
        list_of_classes.append(classID[i])


for i in range(len(df)):
  ...............................

UPDATE

Say the dataframe looks like :

........................................
Feature0  Feature1  Feature2  Feature3  ......... classID lastColum 


 190       565     35474  0.336283   2.973684       255         0   
 311       984    113199  0.316057   3.163987       155         0   
 310       984     94197  0.315041   3.174194      1005         0   
 280       984    116359  0.284553   3.514286       255        18   
 249       984    107482  0.253049   3.951807      1005         0   
 283       984    132343  0.287602   3.477032       155         0   
 213       984     88244  0.216463   4.619718       255         0   
 839       984    203139  0.852642   1.172825       255         0   
 376       984    105133  0.382114   2.617021      1005         0   
 324       984    129209  0.329268   3.037037      1005         0   

in this example the result that I'm aiming to get, is 3 dataframes, each of them has only 1 classID either 155, 1005, or 255. my question is, is there a finer way to do this ?

Engine
  • 5,360
  • 18
  • 84
  • 162

2 Answers2

2

Split to 3 separate CSV files:

df.groupby('classID') \
  .apply(lambda x: x.to_csv(r'c:/temp/{}.csv'.format(x.name), index=False))

Generate a dictionary of "splitted" DataFrames:

In [210]: dfs = {g:x for g,x in df.groupby('classID')}

In [211]: dfs.keys()
Out[211]: dict_keys([155, 255, 1005])

In [212]: dfs[155]
Out[212]:
   Feature0  Feature1  Feature2  Feature3  classID  lastColum
1       311       984    113199  0.316057      155          0
5       283       984    132343  0.287602      155          0

In [213]: dfs[255]
Out[213]:
   Feature0  Feature1  Feature2  Feature3  classID  lastColum
0       190       565     35474  0.336283      255          0
3       280       984    116359  0.284553      255         18
6       213       984     88244  0.216463      255          0
7       839       984    203139  0.852642      255          0

In [214]: dfs[1005]
Out[214]:
   Feature0  Feature1  Feature2  Feature3  classID  lastColum
2       310       984     94197  0.315041     1005          0
4       249       984    107482  0.253049     1005          0
8       376       984    105133  0.382114     1005          0
9       324       984    129209  0.329268     1005          0
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
0

Here is an example of how you can do it:

import pandas as pd

df = pd.DataFrame({'A': list('abcdef'), 'part': [1, 1, 1, 2, 2, 2]})

parts = df.part.unique()

for part in parts:
    print df.loc[df.part == part]

So the point is that you take all unique parts by calling unique() on series that you want to use for split.

After that, you can access those parts via loop and do whatever you need on each one of them.

zipa
  • 27,316
  • 6
  • 40
  • 58