1

I want to extract all dates (in the specific date format - January 1, 2020) into the dictionary in python. My text is for example:

"Psg January 1, 2020 hsjkfsdlkfhshdfh January 2, 1908 hdhahhajshjdjoi December 31, 2019 fafsfafagherhea"

I will then extract this into:

["January 1, 2020", "January 2, 1908", "December 31, 2019"]

I found the method str.split(). No success with this.

How can I do this?

Thank you for the help!

P.S.

Actually I want to extract that dates and then convert in the format:

"January 1, 2020" -> "1. January 2020"

and then turn it back in the text.

In the nutshell: I want to replace some format of date in text with some other format of a date.

Edit:

I have made the process. Thank you for your effort!

User123
  • 476
  • 7
  • 22
  • 2
    You should do this using regex – Nathan Jan 05 '20 at 10:31
  • 1
    Possible duplicate of [Regex to match date like month name day comma and year](https://stackoverflow.com/questions/35413746/regex-to-match-date-like-month-name-day-comma-and-year/35413952) – Alexandre B. Jan 05 '20 at 10:41
  • 1
    @User123 Please, have a look at the [Why are some questions marked as duplicate?](https://stackoverflow.com/help/duplicates). – Alexandre B. Jan 05 '20 at 10:46

6 Answers6

2

For this task is better to use regular expressions (re module in Python).

For example (Regex101 for explanation):

txt = '''Psg January 1, 2020 hsjkfsdlkfhshdfh January 2, 1908 hdhahhajshjdjoi December 31, 2019 fafsfafagherhea'''
import re

r = re.compile(r'(January|February|March|April|May|June|July|August|September|October|November|December)\s*(\d+),\s*(\d+)')

new_txt = r.sub(r'\2. \1 \3', txt)
print(new_txt)

Prints:

Psg 1. January 2020 hsjkfsdlkfhshdfh 2. January 1908 hdhahhajshjdjoi 31. December 2019 fafsfafagherhea
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
2

Using a regex like will help you easily '((?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d+)'

message = "Psg January 1, 2020 hsjkfsdlkfhshdfh January 2, 1908 hdhahhajshjdjoi December 31, 2019 fafsfafagherhea"
matches = re.findall(
    r'((?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d+)',
    message)
for match in matches:
    print(match)

Then for the date format, use strptime and strftime

from datetime import datetime

input_format = "%B %d, %Y" # full month name, day and year
output_format = "%d.%B %Y"
for match in matches:
    new_date = datetime.strptime(match, input_format).strftime(output_format)
    print(match, ">>", new_date)
azro
  • 53,056
  • 7
  • 34
  • 70
2

This has been asked a dozen times. Imo the best way is to use a library, e.g. datefinder:

import datefinder
text = "Psg January 1, 2020 hsjkfsdlkfhshdfh January 2, 1908 hdhahhajshjdjoi December 31, 2019 fafsfafagherhea"
matches = datefinder.find_dates(text)

for match in matches:
    print(match)

Which yields

2020-01-01 00:00:00
1908-01-02 00:00:00
2019-12-31 00:00:00
Jan
  • 42,290
  • 8
  • 54
  • 79
1

You can use the function find() to research the index of any month and you can count the character to extract

Seen : https://www.journaldev.com/23666/python-string-find

DaxBrin
  • 86
  • 5
1
months = ['January', 'February', 'March', 'April', 'May', 'June',
      'July', 'August', 'September', 'October', 'November', 'December']

date_info = "Psg January 1, 2020 hsjkfsdlkfhshdfh January 2, 1908 hdhahhajshjdjoi December 31, 2019 fafsfafagherhea"

for month in months:
    while month in date_info.split():
        m = month
        day = date_info.split()[date_info.split().index(month) + 1]
        year = date_info.split()[date_info.split().index(month) + 2]
        self_str = month + " " + day + " " + year
        rep_str = day.strip(',') + ". " + month +"~ " + year
        date_info = date_info.replace(self_str, rep_str)
while '~' in date_info:
    date_info = date_info.replace("~", "")
print(date_info)
1

Pure regex solution would be to use following regex to extract the specific date format strings from the given sentence :

\w+\s+\d{1,2},\s+\d{4}

Regex explanation and demo can be found here.

Then use this regex in re.findall function to match all occurrences of a pattern and return such occurrences.

import re
str = "Psg January 1, 2020 hsjkfsdlkfhshdfh January 2, 1908 hdhahhajshjdjoi December 31, 2019 fafsfafagherhea"
x = re.findall("\w+\s+\d{1,2},\s+\d{4}", str)
print(x)
Output :

['January 1, 2020', 'January 2, 1908', 'December 31, 2019']
rprakash
  • 500
  • 5
  • 10