1

How can I iterarate over rows in a dataframe until the sample ID change?

my_df:

ID        loc_start
sample1   10
sample1   15
sample2   10
sample2   20
sample3   5

Something like:

samples = ["sample1", "sample2" ,"sample3"] 

out = pd.DataFrame()
for sample in samples:
    if my_df["ID"] == sample:
        my_list = []
        for index, row in my_df.iterrows():
            other_list = [row.loc_start]
            my_list.append(other_list)
        my_list = pd.DataFrame(my_list)
        out = pd.merge(out, my_list) 

Expected output:

sample1   sample2   sample3
     10        10         5    
     15        20

I realize of course that this could be done easier if my_df really would look like this. However, what I'm after is the principle to iterate over rows until a certain column value change.

lindak
  • 167
  • 7
  • 1
    @ScottBoston Please see edit – lindak Oct 10 '19 at 13:28
  • 1
    @lindak it seems more like a `pivot` operation what you try to do. have a look at this [answer](https://stackoverflow.com/a/22799916/9274732), they use 2 columns (product and price) and you just have one (ID) but otherwise it should give you an idea how to do it :) – Ben.T Oct 10 '19 at 13:41

2 Answers2

0

Based on the input & output provided, this should work. You need to provide more info if you are looking for something else.

df.pivot(columns='ID', values = 'loc_start').rename_axis(None, axis=1).apply(lambda x: pd.Series(x.dropna().values))

output

sample1     sample2     sample3
0   10.0    10.0    5.0
1   15.0    20.0    NaN
moys
  • 7,747
  • 2
  • 11
  • 42
0

Ben.T is correct that a pivot works here. Here is an example:

import pandas as pd
import numpy as np

df = pd.DataFrame(data=np.random.randint(0, 5, (10, 2)), columns=list("AB"))

# what does the df look like?  Here, I consider column A to be analogous to your "ID" column

In [5]: df
Out[5]: 
   A  B
0  3  1
1  2  1
2  4  2
3  4  1
4  0  4
5  4  2
6  4  1
7  3  1
8  1  1
9  4  0


# now do a pivot and see what it looks like
df2 = df.pivot(columns="A", values="B")

In [8]: df2
Out[8]: 
A    0    1    2    3    4
0  NaN  NaN  NaN  1.0  NaN
1  NaN  NaN  1.0  NaN  NaN
2  NaN  NaN  NaN  NaN  2.0
3  NaN  NaN  NaN  NaN  1.0
4  4.0  NaN  NaN  NaN  NaN
5  NaN  NaN  NaN  NaN  2.0
6  NaN  NaN  NaN  NaN  1.0
7  NaN  NaN  NaN  1.0  NaN
8  NaN  1.0  NaN  NaN  NaN
9  NaN  NaN  NaN  NaN  0.0

Not quite what you wanted. With a little help from Jezreal's answer

df2 = df2.apply(lambda x: pd.Series(x.dropna().values))

In [20]: df3
Out[20]: 
A    0    1    2    3    4
0  4.0  1.0  1.0  1.0  2.0
1  NaN  NaN  NaN  1.0  1.0
2  NaN  NaN  NaN  NaN  2.0
3  NaN  NaN  NaN  NaN  1.0
4  NaN  NaN  NaN  NaN  0.0

The empty spots in the dataframe have to be filled with something, and NaN is used by default. Is this what you wanted?

If, on the other hand, you wanted to perform an operation on your data you would use the groupby instead.

df2 = df.groupby(by="A", as_index=False).mean()

tnknepp
  • 5,888
  • 6
  • 43
  • 57