1

I have a wide format data frame with date range and empty strings as column names but the first row has some of the intended column headers, so I need a code that deduces the week from the headers then picks the column name from the first row and renames it (i.e week1_quantity, week1_sales, week1_profit)

import pandas as pd
df = pd.DataFrame([
    {'Related Fields':'Description', 'Unnamed 1':'barcode',
        'Unnamed 2':'department', 'Unnamed 3':'section',
        'Unnamed 4':'reference', 'Sales: (06/07/2020,12/07/2020)':'Quantity',
        'Unnamed 6':'amount', 'Unnamed 7':'cost',
        'Unnamed 8':'% M/S', 'Unnamed 9': 'profit',
        'Sales: (29/06/2020,05/07/2020)': 'Quantity',
        'Unnamed 11':'amount', 'Unnamed 12':'cost',
        'Unnamed 13':'% M/S', 'Unnamed 14':'profit'},
    {'Related Fields':'cornflakes', 'Unnamed 1':'0001198',
        'Unnamed 2':'grocery', 'Unnamed 3':'breakefast',
        'Unnamed 4': '0001198', 'Sales: (06/07/2020,12/07/2020)': 60,
        'Unnamed 6': 6000, 'Unnamed 7':3000, 'Unnamed 8':50,
        'Unnamed 9':3000, 'Sales: (29/06/2020,05/07/2020)': 120,
        'Unnamed 11':12000, 'Unnamed 12':6000, 'Unnamed 13':50,
        'Unnamed 14':6000}
])

Expected result

df2 = pd.DataFrame([
    {'Description':'cornflakes', 'barcode':'0001198',
        'department':'grocery', 'section':'breakefast',
        'reference':'0001198', 'week28_quantity':60,
        'week28_amount':6000, 'week28_cost':3000,
        'week28_% M/S':50, 'week28_profit':3000,
        'week29_quantity':120, 'week29_amount':6000,
        'week29_cost':6000, 'week29_% M/S':50,
        'week28_profit':6000}
])

I've tried to change the name manually but would like an automated solution.

RichieV
  • 5,103
  • 2
  • 11
  • 24
Rich
  • 9
  • 6

1 Answers1

0

You can solve by parsing the date with datetime.strptime and using datetime.isocalendar to get the weeknumber.

from datetime import datetime

# get week numbers
wknums = [
    'week' + str(
        datetime.strptime(colname.split()[1][1:11], '%d/%m/%Y')
        .isocalendar()[1]
    ) + '_'
    if colname.startswith('Sales')
    else None
    for colname in df.columns
]

wknums = (
    pd.Series(wknums).ffill().fillna('') # forward fill week numbers
    + df.loc[0].to_numpy() # add text from first row
).str.lower() # change to lower case, use it only if it helps


df.columns = wknums # replace df column labels
df = df.iloc[1:].reset_index(drop=True) # drop first row

Output

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   description      1 non-null      object
 1   barcode          1 non-null      object
 2   department       1 non-null      object
 3   section          1 non-null      object
 4   reference        1 non-null      object
 5   week28_quantity  1 non-null      object
 6   week28_amount    1 non-null      object
 7   week28_cost      1 non-null      object
 8   week28_% m/s     1 non-null      object
 9   week28_profit    1 non-null      object
 10  week27_quantity  1 non-null      object
 11  week27_amount    1 non-null      object
 12  week27_cost      1 non-null      object
 13  week27_% m/s     1 non-null      object
 14  week27_profit    1 non-null      object
dtypes: object(15)
memory usage: 248.0+ bytes
RichieV
  • 5,103
  • 2
  • 11
  • 24
  • You're welcome, you can accept the answer with the tick mark below the voting arrows, your selection can be changed if a better answer comes along – RichieV Sep 22 '20 at 07:26
  • I would do that once I'm done with this, I have a different question this time, what if I wanted to create a new column which would be the row-based mean of each weekn_quantity excluding those with zero values and another column to show the mean of the top 5 values of each weekn_quantity row. – Rich Sep 22 '20 at 16:23
  • I don't understand the exact details of this new request, perhaps you need to post a new question for that... and I don't need you to accept this answer, I'm just pointing out how answers are normally handled according to [the help center](https://stackoverflow.com/help/someone-answers) – RichieV Sep 22 '20 at 16:49
  • If you post a new question make sure you state your requirements clearly and thoroughly, but also try to make it as focused and concise as possible... questions asking for multiple things are closed quickly – RichieV Sep 22 '20 at 16:52
  • Hello @RichieV, what if i wanted the columns to show the month and year as in 01_2020_quantity, 01_2020_amount,... and against week1_quantity, week1_cost,..., how would i go about that, i already tried modifying your code but isocalendar() only shows value for week, weekday and year. Kindly assist in adding the code to that for me in your answer, thank you. – Rich Oct 21 '20 at 07:58
  • You can use `series.dt.year` and `series.dt.month` then concatenate... Most answers (specially if you feel it should be quite simple) are already somewhere on the site or the internet. I suggest you always do a quick search for it. For example I searched *site:stackoverflow.com pandas get month and year* and [this answer](https://stackoverflow.com/a/25149272/6692898) was the first hit – RichieV Oct 21 '20 at 14:46
  • Thank you for the feedback, i did so and was able to figure it out before you responded, i am new to programming but i am trying my best to catch up. – Rich Oct 25 '20 at 21:06
  • It's all good with me, my comment was honestly meant to show how I would do that specific search, you'll get better at spotting the right search terms over time... don't hesitate to write again, best of luck – RichieV Oct 26 '20 at 00:10