Simplifying/collapsing blocks of df.iterrows() where another row is selected based on a value in the current row

Question

I need help :) I've got a python script in which a snippet that repeats df.iterrows() a couple of times takes a good half hour to run. I've seen a lot of examples using lambda functions but this is where apparently I meet my limitations. It's too complex for me to figure out I guess.

Current code (simplified snippet):

for index, row in df.iterrows():
    if row["series"] == -1 and row["sort"] == 1 and row["amount_t-2"] != 0:
        index_match = df[
            (df["sort"] == 4)
            & (df["year"] == (row["year"] - 1))
            & (df["amount_t-1"] == (row["amount_t-2"]))
            & (df["cola"] == (row["cola"]))
            & (df["colb"] == (row["colb"]))
        ].index.tolist()
        if len(index_match) == 1:
            df.loc[index, "series"] = df.loc[index_match[0], "series"]
            df.loc[index, "colc"] = "done"
            df.loc[index, "cold"] = ""

This snippet is repeated a number of times for different configurations/values. What's done here is looping through the entire dataset, creating a time series by setting the 'series' column to the same (index) number if certain conditions are met (i.e. if the values are linked)

Little background: the dataset consists of budgetary tables in which each row represents the state of a budget article for a certain budget phase/time (sort field: 1=budget, 2=first supplement, 3=second supplement, 4=actuals). In order to be able to trace a certain budget article through time, I want to link all relevant rows through a common value for the field 'series'.

When searching for a row to link to the current row, I filter the df by: 'previous phase' (sort-1 or in this case 4 if sort=1), 'year of previous phase' (same year for sort=2,3,4, previous if sort=1), matching value/amounts and matching article name etc.

So far, I've been meddling with something like the following code, but cannot get it to work. Am I at least close?

df['series'] = df.apply(lambda x: df.query("sort == 4 and year == @x['year']-1 and `amount_t-1` == @x['amount_t-2'] and cola == @x['cola'] and colb == @x['colb']").index[0] if x['series'] == -1 and x['sort'] == 1 and x['amount_t-2'] != 0, axis=1)

Thanks in advance,

Chris

Edit: Sample data

Before script starts, database contains:

year sort cola series
2020   1  I.1   1
2020   2  I.1   1
2020   3  I.1   1
2020   4  I.1   1
2021   1  I.2   1 (cola, being article name, is renamed to I.2 for this year)
2021   2  I.2   1
2021   3  I.2   1
2022   1  I.2   1 (note that 2021.4 is not present yet but 2022.1 is)

During script execution (before this snippet!) some rows are added:

2021   4  I.1  -1 (each row is added with a series id of -1 as default)
2021   4  I.2  -1 (this is the row we're looking for in series 1)
2021   4  I.3  -1 (all other rows I.1, I.3, I.* and II.* thru XX.* also get a series number, just not number 1...)

So, superficially (and for many cases also in reality), it seems like the series number corresponds with col.a (article name/number). However, article name/numbers may change over time or articles may be split or merged leading to a break in the series. Thus, I cannot simply use the article number as a guide.

Instead, I compare the amount_t in the current row to the amount_t-1 in the candidate row for previous in the series as a double check.

In the example above, the script should yield:

year sort cola series
2020   1  I.1   1
2020   2  I.1   1
2020   3  I.1   1
2020   4  I.1   1
2021   1  I.2   1
2021   2  I.2   1
2021   3  I.2   1
2021   4  I.1   2
2021   4  I.2   1 <-- this one right here :)
2021   4  I.3   3
2022   1  I.2   1

Hope this clarifies.

Please post sample data and desired output. Human language is not as precise as numeric representation. See [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/1422451). — Parfait, Feb 12 '22 at 16:47
Honestly, I am still confused. Is *series* a new calculated field or original? I don't know what *this one right here* means. Added or changed row? What about other 2 added ones? Why `1.3`? Your background paragraph is very helpful in avoiding the [XY Problem](https://meta.stackexchange.com/q/66377). Try to avoid explaining your Y solution but help us with the fuller X problem. You seem to just need to fill in gaps of time series. Why not `merge` to a set of all possible combinations of year / sort / cola? — Parfait, Feb 13 '22 at 19:00

score 0 · Answer 1 · answered Feb 12 '22 at 16:38

I would suggest you refactor your code in a bit more idiomatic way and use pandas .at method instead of .loc, since it is much faster, like this:

# Get the first condition out of the 'for loop'
sub_df = df.loc[(df["series"] == -1) & (df["sort"] == 1) & (df["amount_t-2"] != 0), :]

# Iterate by asking for forgiveness, not permission
for i, row in sub_df.iterrows():
    try:
        index_match = df[
            (df["sort"] == 4)
            & (df["year"] == (row["year"] - 1))
            & (df["amount_t-1"] == (row["amount_t-2"]))
            & (df["cola"] == (row["cola"]))
            & (df_int["colb"] == (row["colb"]))
        ].index[0]
    except IndexError:
        continue
    else:
        df.at[i, "series"] = df.at[index_match, "series"]
        df.at[i, "colc"] = "done"
        df.at[i, "cold"] = ""

Thanks Laurent, will definitely try this. Ideally, I would like to avoid the iterrows() alltogether, but if I must, I must :) — Christian C. Schouten, Feb 12 '22 at 21:10

Simplifying/collapsing blocks of df.iterrows() where another row is selected based on a value in the current row

1 Answers1