0

I need to read this csv file with Panda, perform some process on it and write the rest 10% of the data to another sheet.

Given this solution (https://stackoverflow.com/a/55763598/3373710), I want to do a process on the rest of store_data after taking out the 10% rows, however, elif condition prints the same rows of the original file, how can I fix my condition to skip the 10% rows?

store_data = pd.read_csv("heart_disease.csv")

with open("out1.csv","w") as outfile:
    outcsv = csv.writer(outfile)
    for i, row in store_data.iterrows():
        if not i % 10: #write 10% of the file to another file
            outcsv.writerow(row)
        elif i % 10:  #I need to do some process on the rest of the file
            store_data = store_data.applymap(str)
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
user91
  • 365
  • 5
  • 14
  • It does not matter which 10% of the rows? – amanb Apr 20 '19 at 09:16
  • put `print(i, i % 10)` in code to see what you get - it should help you understand why it doesn't work – furas Apr 20 '19 at 09:31
  • 1
    Possible duplicate of [How do I create test and train samples from one dataframe with pandas?](https://stackoverflow.com/questions/24147278/how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas) – recnac Apr 28 '19 at 01:39

2 Answers2

1

It is far easier and cleaner to simply split your dataframe into two parts, save the 10% into a file ( dataframe.to_csv(..) ) and apply your calculations to the 90% in the second df.

You do this by calculating a new column that tells you if a row is test or not and divide your dataframe into two along this new columns value:

Data file creation:

fn = "heart_disease.csv"
with open(fn,"w") as f:
    # doubled the data provided
    f.write("""Age,AL,SEX,DIAB,SMOK,CHOL,LAD,RCA,LM
65,0,M,n,y,220,80,75,20\n45,0.2,F,n,n,300,90,35,35\n66,-1,F,y,y,200,90,80,20
70,0.2,F,n,y,220,40,85,15\n80,1.1,M,y,y,200,90,90,25\n55,0,M,y,y,240,95,45,25
90,-1,M,n,y,350,35,75,20\n88,1,F,y,y,200,40,85,20\n50,1.1,M,n,n,220,55,30,30
95,-1,M,n,y,230,75,85,15\n30,1.1,F,n,y,235,75,20,30
65,0,M,n,y,220,80,75,20\n45,0.2,F,n,n,300,90,35,35\n66,-1,F,y,y,200,90,80,20
70,0.2,F,n,y,220,40,85,15\n80,1.1,M,y,y,200,90,90,25\n55,0,M,y,y,240,95,45,25
90,-1,M,n,y,350,35,75,20\n88,1,F,y,y,200,40,85,20\n50,1.1,M,n,n,220,55,30,30
95,-1,M,n,y,230,75,85,15\n30,1.1,F,n,y,235,75,20,30
""") 

Program:

import pandas as pd

fn = "heart_disease.csv"
store_data = pd.read_csv(fn)
print(store_data)

import random
import numpy as np

percentage = 0.1 
store_data["test"] = np.random.rand(len(store_data)) 

test_data = store_data[store_data.test <= percentage]
other_data = store_data[store_data.test > percentage]

print(test_data)
print(other_data)

Output:

# original data 
    Age   AL SEX DIAB SMOK  CHOL  LAD  RCA  LM
0    65  0.0   M    n    y   220   80   75  20
1    45  0.2   F    n    n   300   90   35  35
2    66 -1.0   F    y    y   200   90   80  20
3    70  0.2   F    n    y   220   40   85  15
4    80  1.1   M    y    y   200   90   90  25
5    55  0.0   M    y    y   240   95   45  25
6    90 -1.0   M    n    y   350   35   75  20
7    88  1.0   F    y    y   200   40   85  20
8    50  1.1   M    n    n   220   55   30  30
9    95 -1.0   M    n    y   230   75   85  15
10   30  1.1   F    n    y   235   75   20  30
11   65  0.0   M    n    y   220   80   75  20
12   45  0.2   F    n    n   300   90   35  35
13   66 -1.0   F    y    y   200   90   80  20
14   70  0.2   F    n    y   220   40   85  15
15   80  1.1   M    y    y   200   90   90  25
16   55  0.0   M    y    y   240   95   45  25
17   90 -1.0   M    n    y   350   35   75  20
18   88  1.0   F    y    y   200   40   85  20
19   50  1.1   M    n    n   220   55   30  30
20   95 -1.0   M    n    y   230   75   85  15
21   30  1.1   F    n    y   235   75   20  30

# data  with test <= 0.1
    Age   AL SEX DIAB SMOK  CHOL  LAD  RCA  LM      test
3    70  0.2   F    n    y   220   40   85  15  0.093135
10   30  1.1   F    n    y   235   75   20  30  0.021302

# data with test > 0.1
    Age   AL SEX DIAB SMOK  CHOL  LAD  RCA  LM      test
0    65  0.0   M    n    y   220   80   75  20  0.449546
1    45  0.2   F    n    n   300   90   35  35  0.953321
2    66 -1.0   F    y    y   200   90   80  20  0.928233
4    80  1.1   M    y    y   200   90   90  25  0.672880
5    55  0.0   M    y    y   240   95   45  25  0.136537
6    90 -1.0   M    n    y   350   35   75  20  0.439261
7    88  1.0   F    y    y   200   40   85  20  0.935340
8    50  1.1   M    n    n   220   55   30  30  0.737416
9    95 -1.0   M    n    y   230   75   85  15  0.461699
11   65  0.0   M    n    y   220   80   75  20  0.548624
12   45  0.2   F    n    n   300   90   35  35  0.679861
13   66 -1.0   F    y    y   200   90   80  20  0.195141
14   70  0.2   F    n    y   220   40   85  15  0.997854
15   80  1.1   M    y    y   200   90   90  25  0.871436
16   55  0.0   M    y    y   240   95   45  25  0.907141
17   90 -1.0   M    n    y   350   35   75  20  0.295690
18   88  1.0   F    y    y   200   40   85  20  0.970249
19   50  1.1   M    n    n   220   55   30  30  0.566218
20   95 -1.0   M    n    y   230   75   85  15  0.545188
21   30  1.1   F    n    y   235   75   20  30  0.217490 

It is random, you might get exactly 10% of your data - or you can get fewer/more than 10% - the bigger your data the closer you'll get to 10%.

You can use the "derived" dataframes to store the data into test and other data using df.to_csv.

For a pure pandas solution How do I create test and train samples from one dataframe with pandas? is a duplicate of yours but you seem to be handling csv seperately so not sure if it applies.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
1

Here's a pure Pandas solution:

import pandas as pd
df = pd.read_csv("heart_disease.csv")
#select only 10% of the rows, subtract 1 because index starts with zero
df_slice = df.loc[:round(len(df) * 10 /100) - 1, :]
#write the sliced df to csv
df_slice.to_csv("sliced.csv", index=None)
#to work with the rest of the data, just drop the rows at index where the df_slice rows exist
l = df_slice.index.tolist()
df.drop(df.index[l], inplace=True) #90% of data
#now the df has the rest 90% and you can do whatever you want with it
amanb
  • 5,276
  • 3
  • 19
  • 38