Python/increase code efficiency about multiple columns filter

Question

I was wondering if someone could help me find a more efficiency way to run my code.

I have a dataset contains 7 columns, which are country,sector,year,month,week,weekday,value.

the year column have only 3 elements, 2019,2020,2021

What I have to do here is to substract every value in 2020 and 2021 from 2019. But its more complicated that I need to match the weekday columns.

For example,i need to use year 2020, month 1, week 1, weekday 0(monday) value to substract, year 2019, month 1, week 1, weekday 0(monday) value, if cant find it, it will pass, and so on, which means, the weekday(monday,Tuesaday....must be matched)

And here is my code, it can run, but it tooks me hours:(

for i in itertools.product(year_list,country_list, sector_list,month_list,week_list,weekday_list):
    try:
        data_2 = df_carbon[(df_carbon['country'] == i[1]) 
                  & (df_carbon['sector'] == i[2]) 
                  & (df_carbon['year'] == i[0]) 
                  & (df_carbon['month'] == i[3]) 
                  & (df_carbon['week'] == i[4]) 
                  & (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
        data_1 = df_carbon[(df_carbon['country'] == i[1]) 
                  & (df_carbon['sector'] == i[2]) 
                  & (df_carbon['year'] == 2019) 
                  & (df_carbon['month'] == i[3]) 
                  & (df_carbon['week'] == i[4]) 
                  & (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
        co2.append(data_2-data_1)
        country.append(i[1])
        sector.append(i[2])
        year.append(i[0])
        month.append(i[3])
        week.append(i[4])
        weekday.append(i[5])
    except:
        pass

I changed the for loops to itertools, but it still not fast enough, any other ideas? many thanks:)

############################## here is the sample dataset


    country  co2  sector  date  week  weekday  year  month
    Brazil  108.767782  Power   2019-01-01  0   1   2019    1
    China   14251.044482    Power   2019-01-01  0   1   2019    1
    EU27 & UK   1886.493814  Power  2019-01-01  0   1   2019    1
    France  53.856398   Power   2019-01-01  0   1   2019    1
    Germany  378.323440  Power   2019-01-01  0   1   2019    1
    Japan   21.898788   IA  2021-11-30  48  1   2021    11
    Russia  19.773822   IA  2021-11-30  48  1   2021    11
    Spain   42.293944   IA  2021-11-30  48  1   2021    11
    UK  56.425121   IA  2021-11-30  48  1   2021    11
    US  166.425000  IA  2021-11-30  48  1   2021    11

or this

import pandas as pd
pd.DataFrame({
    'year': [2019, 2020, 2021],
    'co2': [1,2,3],
    'country': ['Brazil', 'Brazil', 'Brazil'],
    'sector': ['power', 'power', 'power'],
    'month': [1, 1, 1],
    'week': [0,0,0],
    'weekday': [0,0,0]
})

Please provide a sample input as text or DataFrame constructor and the matching expected output. — mozway, Feb 08 '22 at 03:41
I think you missed the "**as text or DataFrame constructor**", we can't import data from images ;) — mozway, Feb 08 '22 at 04:00
@mozway thx:P was wondering if this is what you wanted?I'm new to stackflow, the dataframe constructor looks great when i edit it, dont know why it changed to above mess:( or i was also wondering is there anyway i could provide the dataset? — Kow Song, Feb 08 '22 at 04:24
Thanks it's better ;) I fixed the layout. Btw, this is not a DataFrame constructor but a text table. It's still more complicated to use than a constructor, but definitely better than images. Read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for more details ;) — mozway, Feb 08 '22 at 04:28
@mozway thank you! i read ur links and update my question, i used the pd.read_clipboard(sep='\s\s+') to active it:P looks great now? — Kow Song, Feb 08 '22 at 04:58

Raymond Kwok · Accepted Answer · 2022-02-08T05:28:05.490

0

pandas can subtract two dataframe index-by-index, so the idea would be to separate your data into a minuend and a subtrahend, set ['country', 'sector', 'month', 'week', 'weekday'] as their indices, just subtract them, and remove rows (by dropna) where a match in year 2019 is not found.

df_carbon = pd.DataFrame({
    'year': [2019, 2020, 2021],
    'co2': [1,2,3],
    'country': ['ab', 'ab', 'bc']
})

index = ['country']
# index = ['country', 'sector', 'month', 'week', 'weekday']

df_2019 = df_carbon[df_carbon['year']==2019].set_index(index)
df_rest = df_carbon[df_carbon['year']!=2019].set_index(index)

ans = (df_rest - df_2019).reset_index().dropna()
ans['year'] += 2019

Two additional points:

In this subtraction the year is also covered, so I need to add 2019 back.
I created a small example of df_carbon to test my code. If you had provided a more realistic version in text form, I would have tested my code using your data.

edited Feb 08 '22 at 05:28

answered Feb 08 '22 at 04:24

Raymond Kwok

2,461
2
9
11

thx for your help:) I just finished edit my question, was wondering if there is a way that i could provide my whole dataset? the one right now looks great when i edit it:( – Kow Song Feb 08 '22 at 04:28
is your data in csv format? if so, for me, it would be sufficient if you open the csv in a text editor, copy the first line (for the header) plus perhaps 10 example lines which can represent your problem. Then paste those 10+1 lines in your question. – Raymond Kwok Feb 08 '22 at 04:31
However, my answer should work if you uncomment `index = ['country', 'sector', 'month', 'week', 'weekday']`, try it with your data :) – Raymond Kwok Feb 08 '22 at 04:32
Thank you, I tested it with your code, and its correct, it tooks hours from using my code:( I will study it, thank you! – Kow Song Feb 08 '22 at 05:03
You are welcome! I and I think the others would be happy to answer your further question, but paste text data next time ;) – Raymond Kwok Feb 08 '22 at 05:13
Sure! I will:) btw ur code only run 1 sec, when mine run like 5-6hours! ur so good! – Kow Song Feb 08 '22 at 05:16
was wondering above edited pasteable dataset is good enought? or i should create a sample like urs?df_carbon = pd.DataFrame({ 'year': [2019, 2020, 2021], 'co2': [1,2,3], 'country': ['ab', 'ab', 'bc'] }) – Kow Song Feb 08 '22 at 05:22
haha, `pd.DataFrame(...)` takes me 1 second to start testing my code with your data, whereas the data table above would take me 1 minute - because I will write some additional code to convert the table. Either way is fine for me but I think everyone will love to see you do the `pd.DataFrame` thing. – Raymond Kwok Feb 08 '22 at 05:25
Try the data table way first if no one complains ;) it's easier. – Raymond Kwok Feb 08 '22 at 05:29
ok, will do:) thanks – Kow Song Feb 08 '22 at 05:56

Python/increase code efficiency about multiple columns filter

1 Answers1