How to separate date values from a text column with special characters in a pandas dataframe?

Question

I have a column with 4 values like below in a dataframe : Have attached the image below for better understanding

Input

India,Chennai - 24 Oct 1992
India,-Chennai, Oct 1992
(Asia) India,Chennai-22 Oct 1992
India,-Chennai, 1992

Output

Place            

India Chennai           
India Chennai            
(Asia) India Chennai     
India Chennai            

Date

24 Oct 1992   
Oct 1992
22 Oct 1992
1992

I need to split the Date and Year(23 Oct 1992, 1992) separately as a column and the text (India,Chennai) as separate column.

I'm bit confused to extract the values, I tried the replace and split options but couldn't achieve the result.

Would appreciate if somebody could help !!

Apologies for the format of Input and Output data !!

What is the rule that tells you where the "place" ends and the "date" begins? — Karl Knechtel, Oct 13 '19 at 06:32
I don't see such rules, I want to split the date values separately thats it. The date values are mostly present after - or , . — Manoj Ravi, Oct 13 '19 at 06:35
Possible duplicate of [How do I read and write CSV files with Python?](https://stackoverflow.com/questions/41585078/how-do-i-read-and-write-csv-files-with-python) — mkrieger1, Oct 13 '19 at 06:57

score 0 · Answer 1 · answered Oct 13 '19 at 06:54

There are lot of ways to create columns by using Pandas library in python, you can create by creating list or by list of dictionaries or by dictionaries of list. for simple understanding here i am going to use lists

first import pandas as pd

import pandas as pd

creating a list from given data

data = [['India','chennai', '24 Oct', 1992], ['India','chennai', '23 Oct', 1992],\
    ['India','chennai', '23 Oct', 1992],['India','chennai', '21 Oct', 1992]]

creating dataframe from list

df = pd.DataFrame(data, columns = ['Country', 'City', 'Date','Year'], index=(0,1,2,3))

print

print(df)

output will be as

  Country     City    Date  Year
0   India  chennai  24 Oct  1992
1   India  chennai  23 Oct  1992
2   India  chennai  23 Oct  1992
3   India  chennai  21 Oct  1992

hope this will help you

Hi Rahul, Thanks for the answer. But the column is already present in a dataframe, and what about the hyphens and commas in the value ? Request you to have a look at the data once again, Think the question is misunderstood. — Manoj Ravi, Oct 13 '19 at 06:58

score 0 · Answer 2 · answered Oct 13 '19 at 07:02

The following assumes that the first digit is where we always want to split the text. If the assumption fails then the code also fails!

>>> import re
>>> text_array                                                                            
['India,Chennai - 24 Oct 1992', 'India,-Chennai,23 Oct 1992', '(Asia) India,Chennai-22 Oct 1992', 'India,-Chennai, 1992']

# split at the first digit, keep the digit, split at only the first digit
>>> tmp = [re.split("([0-9]){1}", t, maxsplit=1) for t in text_array]                     
>>> tmp                                                                                   
[['India,Chennai - ', '2', '4 Oct 1992'], ['India,-Chennai,', '2', '3 Oct 1992'], ['(Asia) India,Chennai-', '2', '2 Oct 1992'], ['India,-Chennai, ', '1', '992']]

# join the last two fields together to get the digit back.

>>> r = [(i[0], "".join(i[1:])) for i in tmp]                                             
>>> r                                                                                     
[('India,Chennai - ', '24 Oct 1992'), ('India,-Chennai,', '23 Oct 1992'), ('(Asia) India,Chennai-', '22 Oct 1992'), ('India,-Chennai, ', '1992')]

If you have control over the how input is generated then I would suggest that the input is made more consistent and then we can parse using a tool like pandas or directly with csv.

Hope this helps.

Regards,

Prasanth

Hi Prasanth, Thanks for the response. This is a column in pandas dataframe. I dont have any control over the input. — Manoj Ravi, Oct 13 '19 at 07:24

moys · Accepted Answer · 2019-10-13T08:11:31.350

0

Use:

import re
df['Date'] = df['col'].str.split("(-|,)").str[-1]
df['Place'] = df.apply(lambda x: x['col'].split(x['Date']), axis=1).str[0].str.replace(',', ' ').str.replace('-', '')

Input

                             col
0   India,Chennai - 24 Oct 1992
1   India,-Chennai,Oct 1992
2   India,-Chennai, 1992
3   (Asia) India,Chennai-22 Oct 1992

Output

                            col            Place                     Date
0   India,Chennai - 24 Oct 1992            India Chennai            24 Oct 1992
1   India,-Chennai,Oct 1992                India Chennai            Oct 1992
2   India,-Chennai, 1992                   India Chennai            1992
3   (Asia) India,Chennai-22 Oct 1992       (Asia) India Chennai     22 Oct 1992

edited Oct 13 '19 at 08:11

answered Oct 13 '19 at 07:07

moys

7,747
2
11
42

Hi Moys, Thanks for the answer. I just now came across another issue, I have updated the second row of input data. I need Oct 1992, but the code is displaying 1992 only. Would be helpful if you could have a look. Thanks Again !! – Manoj Ravi Oct 13 '19 at 07:22
Code updated. Check it out. This will work when the last comma(') or a hyphen(-) is in between date & place values. – moys Oct 13 '19 at 08:12
Thank you so much moys. The code worked fine. :) – Manoj Ravi Oct 13 '19 at 08:33
But still some hypens are not getting replaced as blank spaces in date column. – Manoj Ravi Oct 13 '19 at 09:23
You can add `. str.replace('-', ' ')` to the code which is generating `Date` column. – moys Oct 13 '19 at 10:19
Nope, thats not working. – Manoj Ravi Oct 13 '19 at 14:56

score 0 · Answer 4 · answered Oct 13 '19 at 08:29

Python code:

import re
import pandas as pd 

input_dir = '/content/drive/My Drive/TestData'
csv_file = '{}/test001.csv'.format(input_dir)

p = re.compile(r'(?:[0-9]|[0-2][0-9]|[3][0-1])\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:\d{4})', re.IGNORECASE)
places = []
dates = []

with open(csv_file, encoding='utf-8', errors='ignore') as f:
  for line in f:
    s = re.sub("[,-]", " ", line.strip())
    s = re.sub("\s+", " ", s)
    r = p.search(s)

    str_date = r.group()
    dates.append(str_date)
    place = s[0:s.find(str_date)]
    places.append(place)

dict = {'Place': places, 
        'Date': dates
       } 

df = pd.DataFrame(dict) 
print(df)

Output:

                   Place         Date
0         India Chennai   24 Oct 1992
1         India Chennai      Oct 1992
2  (Asia) India Chennai   22 Oct 1992
3         India Chennai          1992