Select next N rows in pandas dataframe using iterrows

Question

I need to select each time N rows in a pandas Dataframe using iterrows. Something like this:

def func():
    selected = []
    for i in range(N):
        selected.append(next(dataframe.iterrows()))

    yield selected

But doing this selected has N equal elements. And each time I call func I have always the same result (the first element of the dataframe).

If the dataframe is:

What I want to obtain is:

N = 3
selected = [ [5,8,2], [1,2,3], [4,5,6] ] 
then, calling again the function,
selected = [ [7,8,9], [0,1,2], [3,4,5] ] 
then,
selected = [ [7,8,6], [1,2,3], [5,8,2] ]

Why are you trying to do this, i.e. what is your end goal? If your dataframe fits in memory, you should probably be using `groupby` instead of trying to process it in chunks. You can `groupby(range(dataframe.shape[0] // N)` to simulate the chunks. If you are reading the dataframe from disk, then rather read it in chunks from the source. `read_csv` etc have this functionality already. — Dan, Jul 25 '19 at 09:16
My end goal is something like the flow_from_dataframe of a keras generator, but I cannot use that. So I need to select N (like batch_size) elements and then postprocess them with a Keras model. I think I cannot do this without a generator, slicing the elements inside the dataframe — solopiu, Jul 25 '19 at 09:25
I don't see that point of a generator for that, generators are useful when you don't store all the data in memory at once. But in your case, you already have it in memory, so just use a normal function with return. Unless you are passing this to a keras function that expects only generator? Either way, `.iloc` is the correct way to implement it. — Dan, Jul 25 '19 at 09:36
Yes, sorry, I cannot explain myself. I use it to create my custom generator and then pass this custom generator to fit_generator function. I think I've solved the problem, anyway I'll think more about what you said, thank you so much. — solopiu, Jul 25 '19 at 10:02
have a look at my updated answer: https://stackoverflow.com/a/57198258/1011724 this is a better way to make the generator (tested and working this time!) using `iloc`. This way you (a) avoid [iterating in pandas which is terribly inefficient](https://stackoverflow.com/a/55557758/1011724) (b) avoid making a growing list which is slightly inefficient and (c) stick within pandas idioms making your code more readable to other pandas devs — Dan, Jul 25 '19 at 12:36

Dan · Accepted Answer · 2019-07-25T09:47:46.767

No need for .iterrows(), rather use slicing:

def flow_from_df(dataframe: pd.DataFrame, chunk_size: int = 10):
    for start_row in range(0, dataframe.shape[0], chunk_size):
        end_row  = min(start_row + chunk_size, dataframe.shape[0])
        yield dataframe.iloc[start_row:end_row, :]

To use it:

get_chunk = flow_from_df(dataframe)
chunk1 = next(get_chunk)
chunk2 = next(get_chunk)

Or not using a generator:

def get_chunk(dataframe: pd.DataFrame, chunk_size: int, start_row: int = 0) -> pd.DataFrame:
    end_row  = min(start_row + chunk_size, dataframe.shape[0])

    return dataframe.iloc[start_row:end_row, :]

SM Abu Taher Asif · Answer 2 · 2019-07-25T09:39:52.130

1

return should be used instead of yield. if you want plain data in selected as list of list you can do this:

 def func():
    selected = []
    for index, row in df.iterrows():
        if(index<N):
            rowData =[]
            rowData.append(row['A'])
            rowData.append(row['B'])
            rowData.append(row['C'])
            selected.append(rowData)
        else:
            break

    return selected

edited Jul 25 '19 at 09:39

answered Jul 25 '19 at 09:06

SM Abu Taher Asif

2,221
1
12
14

it doesn't work. Same result for ```selected = func()``` with return and ```selected = next(func())``` with yield – solopiu Jul 25 '19 at 09:12

Gravity Mass · Answer 3 · 2019-07-25T09:32:34.177

I am assuming you are calling the function in a loop. You can try this.

def select_in_df(start, end):
    selected = data_frame[start:end]
    selected = select.values.tolist()
    return selected


print(select_in_df(0, 4)) #to update the start and end values, you can use any loop or whatever is your convenience 

#here is an example 
start = 0
end = 3
for i in range(10): #instead of range you can use data_frame.iterrows() 
    select_in_df(start, end+1) #0:4 which gives you 3 rows
    start = end+1
    end = i

score 1 · Answer 4 · answered Jul 25 '19 at 09:43

1

I think I found an answer, doing this

def func(rowws = df.iterrows(), N=3):
    selected = []
    for i in range(N):
        selected.append(next(rowws))

    yield selected

selected = next(func())

answered Jul 25 '19 at 09:43

solopiu

718
1
9
28

I really recommend you don't do this. `iterrows` is extremely inefficient. Try the generator in my solution, I just edited it and tested it. It works. – Dan Jul 25 '19 at 09:46

score 0 · Answer 5 · answered Jul 25 '19 at 09:10

Try using:

def func(dataframe, N=3):
    return np.array_split(dataframe.values, N)

print(func(dataframe))

Output:

[array([[5, 8, 2],
       [1, 2, 3],
       [4, 5, 6]]), array([[7, 8, 9],
       [0, 1, 2],
       [3, 4, 5]]), array([[7, 8, 6],
       [1, 2, 3]])]

Select next N rows in pandas dataframe using iterrows

5 Answers5

Linked