2

I need to select each time N rows in a pandas Dataframe using iterrows. Something like this:

def func():
    selected = []
    for i in range(N):
        selected.append(next(dataframe.iterrows()))

    yield selected

But doing this selected has N equal elements. And each time I call func I have always the same result (the first element of the dataframe).

If the dataframe is:

   A  B  C
0  5  8  2
1  1  2  3
2  4  5  6
3  7  8  9
4  0  1  2
5  3  4  5
6  7  8  6
7  1  2  3

What I want to obtain is:

N = 3
selected = [ [5,8,2], [1,2,3], [4,5,6] ] 
then, calling again the function,
selected = [ [7,8,9], [0,1,2], [3,4,5] ] 
then,
selected = [ [7,8,6], [1,2,3], [5,8,2] ] 
solopiu
  • 718
  • 1
  • 9
  • 28
  • Why are you trying to do this, i.e. what is your end goal? If your dataframe fits in memory, you should probably be using `groupby` instead of trying to process it in chunks. You can `groupby(range(dataframe.shape[0] // N)` to simulate the chunks. If you are reading the dataframe from disk, then rather read it in chunks from the source. `read_csv` etc have this functionality already. – Dan Jul 25 '19 at 09:16
  • My end goal is something like the flow_from_dataframe of a keras generator, but I cannot use that. So I need to select N (like batch_size) elements and then postprocess them with a Keras model. I think I cannot do this without a generator, slicing the elements inside the dataframe – solopiu Jul 25 '19 at 09:25
  • I don't see that point of a generator for that, generators are useful when you don't store all the data in memory at once. But in your case, you already have it in memory, so just use a normal function with return. Unless you are passing this to a keras function that expects only generator? Either way, `.iloc` is the correct way to implement it. – Dan Jul 25 '19 at 09:36
  • Yes, sorry, I cannot explain myself. I use it to create my custom generator and then pass this custom generator to fit_generator function. I think I've solved the problem, anyway I'll think more about what you said, thank you so much. – solopiu Jul 25 '19 at 10:02
  • have a look at my updated answer: https://stackoverflow.com/a/57198258/1011724 this is a better way to make the generator (tested and working this time!) using `iloc`. This way you (a) avoid [iterating in pandas which is terribly inefficient](https://stackoverflow.com/a/55557758/1011724) (b) avoid making a growing list which is slightly inefficient and (c) stick within pandas idioms making your code more readable to other pandas devs – Dan Jul 25 '19 at 12:36

5 Answers5

5

No need for .iterrows(), rather use slicing:

def flow_from_df(dataframe: pd.DataFrame, chunk_size: int = 10):
    for start_row in range(0, dataframe.shape[0], chunk_size):
        end_row  = min(start_row + chunk_size, dataframe.shape[0])
        yield dataframe.iloc[start_row:end_row, :]

To use it:

get_chunk = flow_from_df(dataframe)
chunk1 = next(get_chunk)
chunk2 = next(get_chunk)

Or not using a generator:

def get_chunk(dataframe: pd.DataFrame, chunk_size: int, start_row: int = 0) -> pd.DataFrame:
    end_row  = min(start_row + chunk_size, dataframe.shape[0])

    return dataframe.iloc[start_row:end_row, :]
Dan
  • 45,079
  • 17
  • 88
  • 157
1

return should be used instead of yield. if you want plain data in selected as list of list you can do this:

 def func():
    selected = []
    for index, row in df.iterrows():
        if(index<N):
            rowData =[]
            rowData.append(row['A'])
            rowData.append(row['B'])
            rowData.append(row['C'])
            selected.append(rowData)
        else:
            break

    return selected
SM Abu Taher Asif
  • 2,221
  • 1
  • 12
  • 14
  • it doesn't work. Same result for ```selected = func()``` with return and ```selected = next(func())``` with yield – solopiu Jul 25 '19 at 09:12
1

I am assuming you are calling the function in a loop. You can try this.

def select_in_df(start, end):
    selected = data_frame[start:end]
    selected = select.values.tolist()
    return selected


print(select_in_df(0, 4)) #to update the start and end values, you can use any loop or whatever is your convenience 

#here is an example 
start = 0
end = 3
for i in range(10): #instead of range you can use data_frame.iterrows() 
    select_in_df(start, end+1) #0:4 which gives you 3 rows
    start = end+1
    end = i
Gravity Mass
  • 605
  • 7
  • 13
1

I think I found an answer, doing this

def func(rowws = df.iterrows(), N=3):
    selected = []
    for i in range(N):
        selected.append(next(rowws))

    yield selected

selected = next(func())
solopiu
  • 718
  • 1
  • 9
  • 28
  • I really recommend you don't do this. `iterrows` is extremely inefficient. Try the generator in my solution, I just edited it and tested it. It works. – Dan Jul 25 '19 at 09:46
0

Try using:

def func(dataframe, N=3):
    return np.array_split(dataframe.values, N)

print(func(dataframe))

Output:

[array([[5, 8, 2],
       [1, 2, 3],
       [4, 5, 6]]), array([[7, 8, 9],
       [0, 1, 2],
       [3, 4, 5]]), array([[7, 8, 6],
       [1, 2, 3]])]
U13-Forward
  • 69,221
  • 14
  • 89
  • 114