1

For example I want to replace 2/8/2014 0:00 with 2014 and 1/29/2015 0:00 with 2015 and henceforth.

2014               180657
2015               153837
2014                72395
2012                69708
2013                61364
2015                54117
2013                 3313
2012                 1076
2/8/2014 0:00           2
7/3/2014 0:00           2
1/29/2015 0:00          2
9/1/2014 0:00           2
11/22/2014 0:00         2
10/16/2014 0:00         2
Shashank Singh
  • 647
  • 1
  • 5
  • 22
Akshit
  • 123
  • 1
  • 8
  • 1
    Why don't you just convert to a datetime and take the year from it? It would be helpful if you could show what you have tried so that we can explain why that failed as this is not a complicated problem to solve. – roganjosh May 26 '18 at 09:58
  • Ok, the edit has thoroughly confused me. What are we looking at here? Is this a single dataframe? – roganjosh May 26 '18 at 10:02
  • Yes it is a single data frame where I have the value counts for the each year. Most of the "year" entries are not in proper format and I have to replace those into only the year. Btw I am new to all this so really appreciate your help. – Akshit May 26 '18 at 10:10
  • Right, so a single column can either contain a year or a datetime? That really does complicate things quite a bit :) Are they the only two possible "things" in that column? – roganjosh May 26 '18 at 10:12

2 Answers2

4

Starting with the Series, ser:

2014               180657
2015               153837
2014                72395
2012                69708
2013                61364
2015                54117
2013                 3313
2012                 1076
2/8/2014 0:00           2
7/3/2014 0:00           2
1/29/2015 0:00          2
9/1/2014 0:00           2
11/22/2014 0:00         2
10/16/2014 0:00         2
dtype: int64

You can convert the index to datetime and extract the year:

ser.index = pd.to_datetime(ser.index, errors='coerce').year
ser

2014    180657
2015    153837
2014     72395
2012     69708
2013     61364
2015     54117
2013      3313
2012      1076
2014         2
2014         2
2015         2
2014         2
2014         2
2014         2
dtype: int64

If this introduces NaNs, you can eliminate them by

ser = ser[ser.index.notnull()]
ser.index = ser.index.astype('int')

And if you want to group this by year, you can group on index:

ser.groupby(level=0).sum()
Out: 
2012     70784
2013     64677
2014    253062
2015    207956
dtype: int64
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • Huh, I really didn't think `to_datetime` was so tolerant for such a mixed format in a single column. – roganjosh May 26 '18 at 10:13
  • @roganjosh Yeah I also get surprised by pandas' capabilities with date time data all the time. :) – ayhan May 26 '18 at 10:15
  • 1
    As do I, but usually in the sense of [this](https://stackoverflow.com/a/21916253/4799172) where Wes, in the same question, starts his answer with "welcome to hell" :P – roganjosh May 26 '18 at 10:20
  • I am getting this error when I try the code: TypeError: invalid string coercion to datetime – Akshit May 26 '18 at 10:31
  • 1
    Note this solution assumes the integer indices are strings, you can accommodate this trivially by `s.index.astype(str)`. – jpp May 26 '18 at 10:35
  • @jpp how could they not be strings? A column must have a single dtype, no? So it couldn't support the string dates without converting the whole column to a string type. – roganjosh May 26 '18 at 10:40
  • @roganjosh, Extreme case: `s = pd.Series(data=[180657, 153837, 2, 2], index=[2014, 2015, '2/8/2014 0:00', '7/3/2014 0:00'])` – jpp May 26 '18 at 10:40
  • @Akshit You have some values that pandas couldn't convert to date. You can identify them by `ser[pd.to_datetime(ser.index, errors='coerce').isnull()]`. Can you post the output of that? – ayhan May 26 '18 at 10:54
  • 28:48.3 1 44:42.2 1 37:26.1 1 53:04.4 1 35:25.2 1 56:26.8 1 37:50.0 1 31:13.0 1 31:29.9 1 34:49.3 1 26:59.5 1 48:02.3 1 42:04.8 1 44:20.8 1 37:38.7 1 56:37.9 1 40:48.2 1 39:47.7 1 Name: year, dtype: int64 – Akshit May 26 '18 at 11:02
  • I am unable to post this in the tabular format in the comments section. Anyways, this is what I got when I used ser[pd.to_datetime(ser.index, errors='coerce').isnull()] – Akshit May 26 '18 at 11:05
  • How do you want to handle them? For example `28:48.3` doesn't seem like date type data. Do you want to drop them or is there a specific year that these correspond to? – ayhan May 26 '18 at 11:07
  • Don't have the year info on these so I would like to drop them – Akshit May 26 '18 at 11:11
0

Try this:

s = pd.Series(['2017','2/3/2018 6:45'])
s = s.apply(lambda x: x.split()[0][-4:])
print(s)

Output:

0    2017
1    2018
dtype: object

This is just with a dummy series

just use the apply function for series then add a lambda into the parameters then just index it

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • This is easy when the entries are less. How do I do this when I have 5,00,000 entries. Apologies if these questions seem a bit silly, I am new to pandas. – Akshit May 26 '18 at 10:53
  • @Akshit I just tried my solution with 500,000 entries, it still works – U13-Forward May 26 '18 at 10:58