Maybe you can try to clean up the string first, and convert it into datetime
format and finally get the year part.
import pandas as pd
import re
s = ["1913/12/30 : classé MH", "1913/12/30 : classé MH","1913/12/30 : classé MH"]
df = pd.DataFrame({"date" : s})
#df
date
0 1913/12/30 : classé MH
1 1913/12/30 : classé MH
2 1913/12/30 : classé MH
drop = re.compile(r'[^(\d{4}\/\d{2}\/\d{2})]')
df["clean_date"] = df["date"].str.replace(drop, "")
df["year"] = pd.to_datetime(df["clean_date"], format = "%Y/%m/%d").dt.year
# df
date clean_date year
0 1913/12/30 : classé MH 1913/12/30 1913
1 1913/12/30 : classé MH 1913/12/30 1913
2 1913/12/30 : classé MH 1913/12/30 1913
Looks into re.compile
, \d{4}\/\d{2}\/\d{2}
is used to compile a regular expression pattern, which tells python to find the string where match a rule with :
\d{4} : digit appears four times.
\/ : a slash.
\d{2} : digit appears two times.
And we can see that \d{4}\/\d{2}\/\d{2}
construct a date-like
string, for example, 2022/04/10
.
Moreover, the ^
before the given pattern \d{4}\/\d{2}\/\d{2}
means that I want to exclude the pattern.
So, what I did later is that I replace the string with the part is not a date-like pattern with empty string.
Which means :
We can separate 1913/12/30 : classé MH
into two parts:
1913/12/30
: date-like part.
: classé MH
: not a date-like part.
df["date"].str.replace(drop, "")
The code will choose the not a date-like part
and replace it with ""
.
For more information about regular expression, please check https://docs.python.org/3/library/re.html .