I need help :) I've got a python script in which a snippet that repeats df.iterrows() a couple of times takes a good half hour to run. I've seen a lot of examples using lambda functions but this is where apparently I meet my limitations. It's too complex for me to figure out I guess.
Current code (simplified snippet):
for index, row in df.iterrows():
if row["series"] == -1 and row["sort"] == 1 and row["amount_t-2"] != 0:
index_match = df[
(df["sort"] == 4)
& (df["year"] == (row["year"] - 1))
& (df["amount_t-1"] == (row["amount_t-2"]))
& (df["cola"] == (row["cola"]))
& (df["colb"] == (row["colb"]))
].index.tolist()
if len(index_match) == 1:
df.loc[index, "series"] = df.loc[index_match[0], "series"]
df.loc[index, "colc"] = "done"
df.loc[index, "cold"] = ""
This snippet is repeated a number of times for different configurations/values. What's done here is looping through the entire dataset, creating a time series by setting the 'series' column to the same (index) number if certain conditions are met (i.e. if the values are linked)
Little background: the dataset consists of budgetary tables in which each row represents the state of a budget article for a certain budget phase/time (sort field: 1=budget, 2=first supplement, 3=second supplement, 4=actuals). In order to be able to trace a certain budget article through time, I want to link all relevant rows through a common value for the field 'series'.
When searching for a row to link to the current row, I filter the df by: 'previous phase' (sort-1 or in this case 4 if sort=1), 'year of previous phase' (same year for sort=2,3,4, previous if sort=1), matching value/amounts and matching article name etc.
So far, I've been meddling with something like the following code, but cannot get it to work. Am I at least close?
df['series'] = df.apply(lambda x: df.query("sort == 4 and year == @x['year']-1 and `amount_t-1` == @x['amount_t-2'] and cola == @x['cola'] and colb == @x['colb']").index[0] if x['series'] == -1 and x['sort'] == 1 and x['amount_t-2'] != 0, axis=1)
Thanks in advance,
Chris
Edit: Sample data
Before script starts, database contains:
year sort cola series
2020 1 I.1 1
2020 2 I.1 1
2020 3 I.1 1
2020 4 I.1 1
2021 1 I.2 1 (cola, being article name, is renamed to I.2 for this year)
2021 2 I.2 1
2021 3 I.2 1
2022 1 I.2 1 (note that 2021.4 is not present yet but 2022.1 is)
During script execution (before this snippet!) some rows are added:
2021 4 I.1 -1 (each row is added with a series id of -1 as default)
2021 4 I.2 -1 (this is the row we're looking for in series 1)
2021 4 I.3 -1 (all other rows I.1, I.3, I.* and II.* thru XX.* also get a series number, just not number 1...)
So, superficially (and for many cases also in reality), it seems like the series number corresponds with col.a (article name/number). However, article name/numbers may change over time or articles may be split or merged leading to a break in the series. Thus, I cannot simply use the article number as a guide.
Instead, I compare the amount_t in the current row to the amount_t-1 in the candidate row for previous in the series as a double check.
In the example above, the script should yield:
year sort cola series
2020 1 I.1 1
2020 2 I.1 1
2020 3 I.1 1
2020 4 I.1 1
2021 1 I.2 1
2021 2 I.2 1
2021 3 I.2 1
2021 4 I.1 2
2021 4 I.2 1 <-- this one right here :)
2021 4 I.3 3
2022 1 I.2 1
Hope this clarifies.