-1

I have a column with 4 values like below in a dataframe : Have attached the image below for better understanding

Input

India,Chennai - 24 Oct 1992
India,-Chennai, Oct 1992
(Asia) India,Chennai-22 Oct 1992
India,-Chennai, 1992

Output

Place            

India Chennai           
India Chennai            
(Asia) India Chennai     
India Chennai            

Date

24 Oct 1992   
Oct 1992
22 Oct 1992
1992

I need to split the Date and Year(23 Oct 1992, 1992) separately as a column and the text (India,Chennai) as separate column.

I'm bit confused to extract the values, I tried the replace and split options but couldn't achieve the result.

Would appreciate if somebody could help !!

Apologies for the format of Input and Output data !!

martineau
  • 119,623
  • 25
  • 170
  • 301
Manoj Ravi
  • 103
  • 7

4 Answers4

0

There are lot of ways to create columns by using Pandas library in python, you can create by creating list or by list of dictionaries or by dictionaries of list. for simple understanding here i am going to use lists

first import pandas as pd

import pandas as pd

creating a list from given data

data = [['India','chennai', '24 Oct', 1992], ['India','chennai', '23 Oct', 1992],\
    ['India','chennai', '23 Oct', 1992],['India','chennai', '21 Oct', 1992]]

creating dataframe from list

df = pd.DataFrame(data, columns = ['Country', 'City', 'Date','Year'], index=(0,1,2,3))

print

print(df)

output will be as

  Country     City    Date  Year
0   India  chennai  24 Oct  1992
1   India  chennai  23 Oct  1992
2   India  chennai  23 Oct  1992
3   India  chennai  21 Oct  1992

hope this will help you

  • Hi Rahul, Thanks for the answer. But the column is already present in a dataframe, and what about the hyphens and commas in the value ? Request you to have a look at the data once again, Think the question is misunderstood. – Manoj Ravi Oct 13 '19 at 06:58
  • in which format are you having data – Rahul Sunny Oct 13 '19 at 07:02
0

The following assumes that the first digit is where we always want to split the text. If the assumption fails then the code also fails!

>>> import re
>>> text_array                                                                            
['India,Chennai - 24 Oct 1992', 'India,-Chennai,23 Oct 1992', '(Asia) India,Chennai-22 Oct 1992', 'India,-Chennai, 1992']

# split at the first digit, keep the digit, split at only the first digit
>>> tmp = [re.split("([0-9]){1}", t, maxsplit=1) for t in text_array]                     
>>> tmp                                                                                   
[['India,Chennai - ', '2', '4 Oct 1992'], ['India,-Chennai,', '2', '3 Oct 1992'], ['(Asia) India,Chennai-', '2', '2 Oct 1992'], ['India,-Chennai, ', '1', '992']]

# join the last two fields together to get the digit back.

>>> r = [(i[0], "".join(i[1:])) for i in tmp]                                             
>>> r                                                                                     
[('India,Chennai - ', '24 Oct 1992'), ('India,-Chennai,', '23 Oct 1992'), ('(Asia) India,Chennai-', '22 Oct 1992'), ('India,-Chennai, ', '1992')]

If you have control over the how input is generated then I would suggest that the input is made more consistent and then we can parse using a tool like pandas or directly with csv.

Hope this helps.

Regards,

Prasanth

Prasanth
  • 412
  • 2
  • 4
  • Hi Prasanth, Thanks for the response. This is a column in pandas dataframe. I dont have any control over the input. – Manoj Ravi Oct 13 '19 at 07:24
0

Use:

import re
df['Date'] = df['col'].str.split("(-|,)").str[-1]
df['Place'] = df.apply(lambda x: x['col'].split(x['Date']), axis=1).str[0].str.replace(',', ' ').str.replace('-', '')

Input

                             col
0   India,Chennai - 24 Oct 1992
1   India,-Chennai,Oct 1992
2   India,-Chennai, 1992
3   (Asia) India,Chennai-22 Oct 1992

Output

                            col            Place                     Date
0   India,Chennai - 24 Oct 1992            India Chennai            24 Oct 1992
1   India,-Chennai,Oct 1992                India Chennai            Oct 1992
2   India,-Chennai, 1992                   India Chennai            1992
3   (Asia) India,Chennai-22 Oct 1992       (Asia) India Chennai     22 Oct 1992
moys
  • 7,747
  • 2
  • 11
  • 42
0

Python code:

import re
import pandas as pd 

input_dir = '/content/drive/My Drive/TestData'
csv_file = '{}/test001.csv'.format(input_dir)

p = re.compile(r'(?:[0-9]|[0-2][0-9]|[3][0-1])\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:\d{4})', re.IGNORECASE)
places = []
dates = []

with open(csv_file, encoding='utf-8', errors='ignore') as f:
  for line in f:
    s = re.sub("[,-]", " ", line.strip())
    s = re.sub("\s+", " ", s)
    r = p.search(s)

    str_date = r.group()
    dates.append(str_date)
    place = s[0:s.find(str_date)]
    places.append(place)

dict = {'Place': places, 
        'Date': dates
       } 

df = pd.DataFrame(dict) 
print(df) 

Output:

                   Place         Date
0         India Chennai   24 Oct 1992
1         India Chennai      Oct 1992
2  (Asia) India Chennai   22 Oct 1992
3         India Chennai          1992
Neda Peyrone
  • 190
  • 1
  • 5