0

I need to create a new column in the dataset that contains only the year. The dpro columns contains more text example: 1913/12/30 : classé MH. I´ve tried with other arguments but Something is missing and I am junior in python. Thanks

Code:

monuments["year_protec"] = pd.to_datetime(monuments["dpro"], format ="%Y",errors ="coerce")
monuments.head()
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
adrimon07
  • 13
  • 3
  • 1
    Could you provide a sample of your `dpro` column? – Jon Clements Apr 09 '22 at 17:12
  • Welcome to Stack Overflow! For help with non-working code, you need to make a [mre] including example input, expected output, and actual output -- or if you get an error, the [full error message with traceback](https://meta.stackoverflow.com/q/359146/4518341). See [How to make good reproducible pandas examples](/q/20109391/4518341) for specifics. You can [edit]. For more tips, see [ask]. – wjandrea Apr 09 '22 at 17:34
  • Hello Jon, example, dpro: 1913/12/30 : classé MH, 1932/12/29 : classé MH, 1948/04/13 : inscrit MH...but a few rows have corrupted data (like references to the year 20115) so I used in the function errors="coerce". – adrimon07 Apr 10 '22 at 02:09

1 Answers1

0

Maybe you can try to clean up the string first, and convert it into datetime format and finally get the year part.

import pandas as pd
import re

s = ["1913/12/30 : classé MH", "1913/12/30 : classé MH","1913/12/30 : classé MH"]
df = pd.DataFrame({"date" : s})

#df 

    date
0   1913/12/30 : classé MH
1   1913/12/30 : classé MH
2   1913/12/30 : classé MH
drop = re.compile(r'[^(\d{4}\/\d{2}\/\d{2})]')
df["clean_date"] = df["date"].str.replace(drop, "")
df["year"] = pd.to_datetime(df["clean_date"], format = "%Y/%m/%d").dt.year
# df
    date                    clean_date  year
0   1913/12/30 : classé MH  1913/12/30  1913
1   1913/12/30 : classé MH  1913/12/30  1913
2   1913/12/30 : classé MH  1913/12/30  1913

Looks into re.compile, \d{4}\/\d{2}\/\d{2} is used to compile a regular expression pattern, which tells python to find the string where match a rule with :

\d{4} : digit appears four times.

\/ : a slash.

\d{2} : digit appears two times.

And we can see that \d{4}\/\d{2}\/\d{2} construct a date-like string, for example, 2022/04/10.

Moreover, the ^ before the given pattern \d{4}\/\d{2}\/\d{2} means that I want to exclude the pattern.

So, what I did later is that I replace the string with the part is not a date-like pattern with empty string.

Which means :

We can separate 1913/12/30 : classé MH into two parts:

  1. 1913/12/30: date-like part.
  2. : classé MH : not a date-like part.
df["date"].str.replace(drop, "")

The code will choose the not a date-like part and replace it with "".

For more information about regular expression, please check https://docs.python.org/3/library/re.html .

Denny Chen
  • 489
  • 3
  • 8