0

Consider the following file test.csv:

"Time","RegionCode","RegionName","NumValue"
"2009-W40","AT","Austria",0
"2009-W40","BE","Belgium",54
"2009-W40","BG","Bulgaria",0
"2009-W40","CZ","Czech Republic",1

I'd like to parse the date which is stored in the first column and would like to create a dataframe like so:

parser = lambda x: pd.datetime.strptime(x, "%Y-W%W")
df = pd.read_csv("test.csv", parse_dates=["Time"], date_parser=parser)

Result:

    Time    RegionCode  RegionName  NumValue
0   2009-01-01  AT  Austria 0
1   2009-01-01  BE  Belgium 54
2   2009-01-01  BG  Bulgaria    0
3   2009-01-01  CZ  Czech Republic  1

However, the resulting time column is not correct. All I get is "2019-01-01" and this is certainly not the 40th week of the year. Am I doing something wrong? Anybody else had this issue when parsing weeks?

r0f1
  • 2,717
  • 3
  • 26
  • 39
  • 1
    Possible duplicate , check this https://stackoverflow.com/questions/17087314/get-date-from-week-number – moys Oct 05 '19 at 11:00
  • Thanks, changing the line to `parser = lambda x: pd.datetime.strptime(x+"-1", "%Y-W%W-%w")` solved my issue! – r0f1 Oct 05 '19 at 11:05

2 Answers2

2

You are almost correct. The only problem is that from a week number and year, you cannot determine a specific date. The trick is to just add day of the week as 1.

I would recommend sticking with pd.to_datetime() like you tried initially and supplying a date-format string. That should work out fine with the added 1:

pd.to_datetime(df['Time'] + '-1', format='%Y-W%W-%w')
# 0   2009-10-05
# 1   2009-10-05
# 2   2009-10-05
# 3   2009-10-05
KenHBS
  • 6,756
  • 6
  • 37
  • 52
1

I am not sure if you can parse it directly upon read_csv, but you can certainly do it after:

import pandas as pd

test = [
["2009-W40","AT","Austria",0],
["2009-W40","BE","Belgium",54],
["2009-W40","BG","Bulgaria",0],
["2009-W40","CZ","Czech Republic",1]]

df = pd.DataFrame(test,columns=["Time","RegionCode","RegionName","NumValue"])

df["Time"] = pd.to_datetime(df["Time"].str[:4],format="%Y") + \
             pd.to_timedelta(df["Time"].str[-2:].astype(int).mul(7),unit="days")

print (df)

#
        Time RegionCode      RegionName  NumValue
0 2009-10-08         AT         Austria         0
1 2009-10-08         BE         Belgium        54
2 2009-10-08         BG        Bulgaria         0
3 2009-10-08         CZ  Czech Republic         1
Henry Yik
  • 22,275
  • 4
  • 18
  • 40