0

I am trying to complete the below question and am running into some creating this function and assigning string values to the various dates. How would I go about creating this function to return the various string values.

nickp
  • 43
  • 6

1 Answers1

2

Here is my suggestion :

EDIT : using this question for the regex to check time validity

import pandas as pd
import numpy as np
import re 


def compute_time_day_year(data_dict):
    """

    returns: pandas DataFrame with variables weekday (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday),
    time_of_day (Morning - 06:00-11:59, Afternoon - 12:00-17:59, Evening - 18:00-23:59, Night - 00:00 - 05:59),
    and season (Summer for June, July and August; Autumn for September, October, November; Winter for
    December, January, February; Spring for March, April, May). If either of the input parameters is in incorrect
    form, the function returns INVALID in all outputs
    """

    df = pd.DataFrame({'year': (data_dict['Year']),
                       'month': (data_dict['Crash_Month']),
                       'day': (data_dict['Crash_Day']),
                       'time': (data_dict['Crash_Time'])})
    df["invalid_time"] = df["time"].apply(lambda x: not re.match("^([0-1]?[0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]$", x))

    df.loc[~df["invalid_time"], 'date'] = pd.to_datetime(df.loc[~df["invalid_time"],'year'].astype(str)
                               + df.loc[~df["invalid_time"],'month'].astype(str).str.zfill(2)
                               + df.loc[~df["invalid_time"],'day'].astype(str).str.zfill(2)
                               + ' '
                               + df.loc[~df["invalid_time"],'time'].astype(str), format='%Y%m%d %H:%M:%S')

    df.loc[~df["invalid_time"],'weekday'] = df.loc[~df["invalid_time"],"date"].dt.day_name(locale="english")
    df.loc[~df["invalid_time"],'season'] = (df.loc[~df["invalid_time"],'date'] - pd.DateOffset(months=1))\
        .dt.quarter
    df.loc[~df["invalid_time"],'season'] = df.loc[~df["invalid_time"],'season']\
        .map({1: 'Winter', 2: 'Spring', 3: 'Summer', 4: 'Autumn'})
    df.loc[~df["invalid_time"],"time_of_day"] = df.loc[~df["invalid_time"],"date"].dt.hour\
        .apply(lambda x: np.select(
            [0 < x <= 6,
             6 < x <= 12,
             12 < x <= 18,
             18 < x <= 24],
            ["Night", "Morning", "Afternoon", "Evening"]
        ))

    df.loc[df["invalid_time"],["weekday", "season", "time_of_day"]] = "Invalid"

    return df.loc[:, ["weekday", "season", "time_of_day"]]

data_dict = {'Year': [2018, 2019, 2020],
             'Crash_Month': [1, 2, 3],
             'Crash_Day': [4, 5, 6],
             'Crash_Time': ["8:00:00", '26:22:00', '8:12:00']}
compute_time_day_year(data_dict)

And it returns for this example:

   weekday   season time_of_day
0  Thursday   Autumn     Morning
1   Invalid  Invalid     Invalid
2    Friday   Winter     Morning

Hope it helps.

Raphaele Adjerad
  • 1,117
  • 6
  • 12
  • thanks a lot for that. I actually imported the data from a CSV, hence why I reference fields as data_dict[...]. This coding doesn't seem to work for the data_dict CSV, do you have any other ideas as to how to code it – nickp Apr 22 '20 at 06:32
  • @nickp I changed my answer to match your question. – Raphaele Adjerad Apr 22 '20 at 07:24
  • Thanks again. Just to clarify, I want an invalid output to come up if one of the values in Crash_Time is not a time. How would I code this? I have tried if type != time, which does not work – nickp Apr 22 '20 at 08:49
  • thanks so much! just one more thing, I only want the invalid output to come up for those entries that are invalid, and not the whole data frame being invalid. Is there a way to do this? At the moment, if I add in the If statement, it only returns 1 row saying INVALID, however, I would still like to be able to see all the other rows if they are not INVALID. Also, I would like to print the first 10 records from the data frame, I assume there is a way to do that? – nickp Apr 23 '20 at 06:09
  • @nickp I edited the answer. Now you have Invalid only for the lines where time is Invalid. – Raphaele Adjerad Apr 23 '20 at 06:29
  • that's great, thanks so much. If I wanted to see only the valid rows, what would I change? How about only invalid? – nickp Apr 23 '20 at 07:08
  • 1
    You can filter the resulting df based on values for instance if `df` is the output of my example, so see only valid use `df.loc[df["time_of_day"] != "invalid", :]` and to see only invalid `df.loc[df["time_of_day"] == "invalid", :]`. If the answer suits you could you mark it as [accepted](https://stackoverflow.com/help/someone-answers)? There is no obligation to do so :). – Raphaele Adjerad Apr 23 '20 at 07:22