1

I have a data frame with 13000 rows and 3 columns:

('time', 'rowScore', 'label')

I want to read subset by subset:

[[1..360], [360..712], ..., [12640..13000]]

I used list too but it's not working:

import pandas as pd
import math
import datetime

result="data.csv"
dataSet = pd.read_csv(result)
TP=0
count=0
x=0
df = pd.DataFrame(dataSet, columns = 
     ['rawScore','label'])
for i,row in df.iterrows():
    data=  row.to_dict()   

    ScoreX= data['rawScore']
    labelX=data['label']


  for i in range (1,13000,360):
     x=x+1
    for j in range (i,360*x,1):
        if ((ScoreX  > 0.3) and (labelX ==0)):
            count=count+1
 print("count=",count)
Mihai Chelaru
  • 7,614
  • 14
  • 45
  • 51
wissal
  • 11
  • 1
  • 4
  • 1
    Look into using the chunksize parameter to your read_csv call. Some detail on it's use is here: http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk – Gavin Dec 17 '18 at 23:02
  • thanks for your reply but my problem when I use list or pandas is how can I get the exacty "scorex" and labelX. – wissal Dec 18 '18 at 00:15
  • the liste is like this: [['time','scor','label'] ,['time','scor','label'], ['time','scor','label'] .......['time','scor','label'] ] – wissal Dec 18 '18 at 00:17
  • Sorry - I focussed on your question title (ie how to read csv subset by subset). Maybe show the result of dataset.head() (and df.head()) as well as considering editing your question title. You might also want to look at the answer here: https://stackoverflow.com/questions/25699439/how-to-iterate-over-consecutive-chunks-of-pandas-dataframe-efficiently as it seems you want to operate on subsets of the dataframe itself, and it should be possible to be much more efficient that iterating over the dataframe row by row. – Gavin Dec 18 '18 at 00:54

1 Answers1

2

You can also use the parameters nrows or skiprows to break it up into chunks. I would recommend against using iterrows since that is typically very slow. If you do this when reading in the values, and saving these chunks separately, then it would skip the iterrows section. This is for the file reading if you want to split up into chunks (which seems to be an intermediate step in what you're trying to do).

Another way is to subset using generators by seeing if the values belong to each set: [[1..360], [360..712], ..., [12640..13000]]

So write a function that takes the chunks with indices divisible by 360 and if the indices are in that range, then choose that particular subset.

I just wrote these approaches down as alternative ideas you might want to play around with, since in some cases you may only want a subset and not all of the chunks for calculation purposes.

qxzsilver
  • 522
  • 1
  • 6
  • 21