2

I have a dataset which contains URLs with publish date (YYYY-MM-DD), visits. I want to calculate benchmark (average) of visits for a complete year. Pages were published on different dates.....e. g. Weightage/contribution of 1st page published in Aug (with 10,000 visits) will be more as compare to 2nd page published in March (11,000).

Here is my dataset:

Click here to see my dataset

First step:

So first of all I want to add a column (i.e. time frame) in my data set which can calculate the time frame from the Publish date. For example: if the page was published on 2019-12-10, it can give the time frame/duration from my today's date, expected o/p: (Dec 2019, 9 Months). i.e. (Month Year on which the page was published, Total months from today)

Second step:

I want to normalize/rescale my data (visits) on the basis of calculated time frame column in step 1.

How can I calculate average/benchmark.

halfer
  • 19,824
  • 17
  • 99
  • 186
ashish1780
  • 47
  • 1
  • 10
  • In the second step, you want to have a table that shows the average of visits in the year? – Maryam Sep 23 '20 at 14:05
  • Yes, i want to calculate average on the value on the basis of months you calculated in step1. I'm also getting error in step 1 while running code:- File "", line 14, in normalize_date date_obj = datetime.strptime(date,"%Y-%m-%d %H:%M:%S") # get datetime object TypeError: strptime() argument 1 must be str, not numpy.datetime64 – ashish1780 Sep 25 '20 at 17:11
  • So per month, maybe there are multiple visits records? – Maryam Sep 26 '20 at 09:03
  • I modify the answer to support the average! – Maryam Sep 26 '20 at 09:11
  • I'm getting this error message..... File "", line 3, in normalize_date date_obj = datetime.strptime(date,"%Y-%m-%d %H:%M:%S") # get datetime object TypeError: strptime() argument 1 must be str, not numpy.datetime64 – ashish1780 Sep 28 '20 at 04:52

1 Answers1

0

for the first step you can use following code: read dataframe

import pandas as pd
df = pd.read_csv("your_df.csv")

My example dataframe as below:

            Pub.Dates Type  Visits
0  2019-12-10 00:00:00    A    1000
1  2019-12-15 00:00:00    A    5000
2  2018-06-10 00:00:00    B    6000
3  2018-03-04 00:00:00    B   12000
4  2019-02-10 00:00:00    A    3000

for normalizing the date: at first define a method to normalize just a date:

from datetime import datetime

def normalize_date(date): # input: '2019-12-10 00:00:00'
    date_obj = datetime.strptime(date,"%Y-%m-%d %H:%M:%S") # get datetime object
    date_to_str = date_obj.strftime("%B %Y") # 'December 2019'
    diff_date = datetime.now() - date_obj # find diff from today 
    diff_month = int(diff_date.days / 30) # convert days to month
    normalized_value = date_to_str + ", " + str(diff_month) + " months"
    return normalized_value # 'December 2019, 9 months'

now apply the above method to all values of the date column:

df['Pub.Dates'] =list(map(lambda x: normalize_date(x), df["Pub.Dates"].values))

The normalized dataframe will be:

                  Pub.Dates Type  Visits
0   December 2019, 9 months    A    1000
1   December 2019, 9 months    A    5000
2      June 2018, 27 months    B    6000
3     March 2018, 31 months    B   12000
4  February 2019, 19 months    A    3000
5       July 2020, 2 months    C    9000

but for the second step if there are multiple records per month you can do the following steps, groupby date and other columns you need then get mean of them:

average_in_visits = df.groupby(("Pub.Dates", "Type")).mean()

the result will be:

                               Visits
Pub.Dates                Type        
December 2019, 9 months  A       3000
February 2019, 19 months A       3000
July 2020, 2 months      C       9000
June 2018, 27 months     B       6000
March 2018, 31 months    B      12000
Maryam
  • 660
  • 6
  • 19
  • Thanks Maryam for helping me in first step, I'm getting this error while running the same code. Please help date_obj = datetime.strptime(date,"%Y-%m-%d %H:%M:%S") # get datetime object TypeError: strptime() argument 1 must be str, not numpy.datetime64 .....................For second step what more information do you need? – ashish1780 Sep 25 '20 at 13:50
  • you should pass each element of date column to the `normalize_date` method, because its input is just string. please run this part of code:`df['Pub.Dates'] =list(map(lambda x: normalize_date(x), df["Pub.Dates"].values))` it runs the normalize_date method for every element and return the list of new dates. – Maryam Sep 26 '20 at 04:59