1

I am trying to build a machine learning model using an excel spreadsheet that cannot be edited. The a few of the columns in the .xls have formatting issues so some of the data is displayed as a datetime stamp instead of an str or int. Here is an example from the pd dataframe:

0     40-49   premeno      15-19                  0-2       yes          3   
1     50-59      ge40      15-19                  0-2        no          1   
2     50-59      ge40      35-39                  0-2        no          2   
3     40-49   premeno      35-39                  0-2       yes          3   
4     40-49   premeno      30-34  **2019-05-03 00:00:00**       yes          2

In line 4, the value of 3-5 has been accidentally formatted as a date (shown as 03 May in the xls) and so is assigned as a datetime stamp in the dataframe. I have tried many methods to replace 2019-05-03 00:00:00 with 3-5 including:

df['column'] = df['column'].replace([('2019-05-03 00:00:00')], '3-5') 

and using Timestamp.replace but neither seem to work. Any ideas of how to replace this mis formatted data points with the correct data?

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
4535jacks
  • 31
  • 2
  • a helper against Excel being clever here would be to specify the cell's content like `="3-5"` or change the cell's type to "text" beforehand. – FObersteiner Apr 21 '21 at 13:55

2 Answers2

0

There might be a simpler way but you may need to apply re.search with positive lookarounds.

import re

pat1 = '(?<=\*{2}\d{4}-0\d-0)(\d)(?= 00:00:00\*\*)'

pat2 = '(?<=\*{2}\d{4}-0)(\d)(?=-0\d 00:00:00\*\*)'

df['column'] = df['column'].astype(str).apply(
        lambda x: (re.search(pat2, '**2019-05-03 00:00:00**').group()
                   +'-'+re.search(pat1, '**2019-05-03 00:00:00**').group())
                   if '**' in x else x
     )
semblable
  • 773
  • 1
  • 8
  • 26
  • 1
    did you mean to search x for the patterns? what's with the `**`? I don't get it... Why don't you search for the whole datetime pattern and extract groups if you have a match? Like e.g. with `pat='\d{4}-(\d{2})-(\d{2})\ 00:00:00'`, you can extract `match.group(2).lstrip('0')+'-'+match.group(1).lstrip('0')`, but I don't think doing this in a lambda is nice. – FObersteiner Apr 21 '21 at 06:43
0

You can iterate the column with an apply and check if the element is an instance of pd.Timestamp; if so, extract a string "day-month", otherwise leave as it is.

Ex:

import pandas as pd

# what you have is something like (mixed datatype column/Series)
df = pd.DataFrame({'label': ['0-2', '1-3', pd.Timestamp('2019-05-03')]})

# iterate the column with an apply, extract day-month string if pd.Timestamp
df['label1'] = df['label'].apply(lambda x: f"{x.day}-{x.month}" if isinstance(x, pd.Timestamp) else x)

# ... to get
df['label1'] 
0    0-2
1    1-3
2    3-5
Name: label1, dtype: object
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • Do we know those datetime-like strings are currently in datetime format? Since they're reading it in from excel it may be likely that they're just strings unless `read_excel()` is doing that automatically. – semblable Apr 21 '21 at 12:57
  • 1
    @KristianCanler if an original entry like "3-5" in Excel later appears as "2019-05-03 00:00:00" in a pandas Series, it is *likely* that Excel (clever as always...) converted to Excel's internal date representation (days since 1900), which is then converted to pandas datetime upon import. – FObersteiner Apr 21 '21 at 13:33