0

I'm trying to build a dataframe that will be used for linear regression. I would like to include 11 independent "dummy" variables that are set to either 1 or 0 based on the month of the year. Without getting too far off topic, I'm using 11 variables instead of 12, as the 12th month is captured by the intercept.

I know many things can be done with pandas without looping through the entire dataframe, and doing things in that manner are typically faster than using a loop.

So, is it possible to grab the month from my date column, and dynamically set a seperate column to either a 1 or a 0 based on that month? Or am I asking a stupid question?

Edit: I should have included more information. A dataframe is structured like this:

Date sku units ordered sessions conversion rate
2020/01/30 abc123 20 200 0.1
2020/01/31 abc123 10 100 0.1
2020/02/01 abc123 15 60 0.25

I would like to make it look like this:

Date sku units ordered sessions conversion rate january february
2020/01/30 abc123 20 200 0.1 1 0
2020/01/31 abc123 10 100 0.1 1 0
2020/02/01 abc123 15 60 0.25 0 1

The code I'm currently using to accomplish this is:

x = 1
while x < 12:
    month = calendar.month_name[x]
    df[month] = 0
    x += 1

for index, row in df.iterrows():
    d = row[0]
    month = d.strftime("%B")
    if not month == "December":
        df.at[index, month] = 1

    df.fillna(0, inplace=True)

Just not sure if this is the best way to accomplish this.

MarcBmann
  • 15
  • 3
  • 2
    "is it possible?" Yes, there are many ways. If you provide a minimal example of what your dataset looks like and what result set you're looking for, you're more likely to get a complete answer with code for your _specific_ case. [Pandas/Python: Set value of one column based on value in another column](https://stackoverflow.com/q/49161120/15497888) – Henry Ecker Apr 27 '21 at 19:46
  • Hey Henry, I appreciate your response. I went ahead and added more details to the OP. I think this clarifies the original question. – MarcBmann Apr 27 '21 at 20:31
  • Follow up question. Are there December values in your DF? You say month 12 is "captured by the intercept" and I assumed that meant that the DF did not contain values of that month, but your loop seems to filter out December. – Henry Ecker Apr 27 '21 at 21:06
  • 1
    I am only including months January through November. I'm feeding this dataframe into a linear regression formula, and the months are independent variables. The thing with regression is that if you include a variable for each of all 12 months, the regression will fail as it can't solve for the intercept. The 12th month in this case is accounted for when the variables for January through November are equal to zero. Edit: For clarity, I have rows that contain sales data in the month of december, but no independent variable for this month. – MarcBmann Apr 27 '21 at 21:09

1 Answers1

0

My approach would be to first get the month number from every month using dt.month:

df['Date'].dt.month
0    1
1    1
2    2
Name: Date, dtype: int64

Then use crosstab with the index to get the tabulation of the counts:

pd.crosstab(
        df.index,
        df['Date'].dt.month
    )
Date   1  2
row_0      
0      1  0
1      1  0
2      0  1

Then merge back to the DF on index:

df = (
    df.merge(pd.crosstab(
        df.index,
        df['Date'].dt.month
    ),
        left_index=True,
        right_index=True)
)

Output:

        Date     sku  units ordered  sessions  conversion rate  1  2
0 2020-01-30  abc123             20       200             0.10  1  0
1 2020-01-31  abc123             10       100             0.10  1  0
2 2020-02-01  abc123             15        60             0.25  0  1

Finally, rename the columns using a mapper generated with the calendar api:

df = df.rename(columns={month_num: calendar.month_name[month_num]
                        for month_num in range(1, 13)})

All together:

import pandas as pd
import calendar

df = pd.DataFrame(
    {'Date': {0: '2020/01/30', 1: '2020/01/31', 2: '2020/02/01'},
     'sku': {0: 'abc123', 1: 'abc123', 2: 'abc123'},
     'units ordered': {0: 20, 1: 10, 2: 15},
     'sessions': {0: 200, 1: 100, 2: 60},
     'conversion rate': {0: 0.1, 1: 0.1, 2: 0.25}})
df['Date'] = df['Date'].astype('datetime64[ns]')

df = (
    df.merge(pd.crosstab(
        df.index,
        df['Date'].dt.month
    ),
        left_index=True,
        right_index=True)
)

df = df.rename(columns={month_num: calendar.month_name[month_num]
                        for month_num in range(1, 13)})

print(df.to_string())

Output:

        Date     sku  units ordered  sessions  conversion rate  January  February
0 2020-01-30  abc123             20       200             0.10        1         0
1 2020-01-31  abc123             10       100             0.10        1         0
2 2020-02-01  abc123             15        60             0.25        0         1
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • This is awesome. I knew there had to be a way of doing this without a loop. This is some straight up pandas-fu. – MarcBmann Apr 27 '21 at 21:00