1

I have a two dataframes:

OrderedDict([('page1',     name       dob
          0   John  07-20200
          1  Lilly   05-1999
          2  James   02-2002), ('page2',      name       dob
          0   Chris   07-2020
          1  Robert   05-1999
          2    barb  02-20022)])

I want to run my reg expression against each date in both dataframes and if they are all matches I want to continue with my program and if there is not a match I want to print a message that shows cases the df name, index and date thats wrong like this:

INVALID DATE: Page1: index 0: dob: 02-20200
INVALID DATE: Page2: index 2: dob: 02-20022

I got to this point

    date_pattern = r'(?<!\d)((?:0?[1-9]|1[0-2])-(?:19|20)\d{2})(?!\d)'
    for df_name, df in employee_dict.items():
    x = df[df.dob.str.contains(date_pattern, regex=True)]
    print(x)

that prints where they do match in a table format but I want to print where they don't match in individual print statements

any ideas?

JTHDR
  • 51
  • 1
  • 1
  • 6
  • 1
    Do you mean you need `for df_name, df in employee_dict.items(): for index, row in df.iterrows(): if not re.search(date_pattern, row['dob']): print("INVALID DATE: {}: index {}: dob: {}".format(df_name, index,row['dob'])) `? (add newlines and indentation that is lost in the comment). – Wiktor Stribiżew May 04 '20 at 09:20
  • THANK YOU! this is exactly what I was trying to do! Do you mind explaining the logic behind this? I'm trying to learn as much as possible so I can improve – JTHDR May 04 '20 at 12:37
  • I added an [answer](https://stackoverflow.com/a/61592409/3832970). – Wiktor Stribiżew May 04 '20 at 12:41

2 Answers2

1

You may iterate over all the rows of the dataframes and if the entry does not match your pattern, you may generate the message of your choice:

for df_name, df in employee_dict.items():       # Iterate over your DFs
  for index, row in df.iterrows():              # Iterate over DF rows 
    if not re.search(date_pattern, row['dob']): # If the dob column value has no match
      print("INVALID DATE: {}: index {}: dob: {}".format(df_name, index,row['dob']))  # Print error message

If your df is pd.DataFrame({'dob': ['05-2020','4-2020','07-1999','2-2001','1-20202020','112-2020']}), the results will be

INVALID DATE: page1: index 4: dob: 1-20202020
INVALID DATE: page1: index 5: dob: 112-2020
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Perfect explanation! if i stored the print statement in a variable like: date_alert = str (DATE ALERT: {}: index {}: dob: {}".format(df_name, index,row['dob'])) ...do you know How to still print both statements because in this case it prints only one which is the last iteration – JTHDR May 04 '20 at 13:12
  • @JTHDR Use a list, `date_alerts =[]`, then add the `date_alert` once found to the list, `date_alerts.append(date_alert)`, then show them (say, `print("\n".join(date_alerts))`) – Wiktor Stribiżew May 04 '20 at 13:14
  • This works too! can you please explain this as well as to why the logic is like this? I'm learning alot! – JTHDR May 04 '20 at 13:19
  • @JTHDR If you have multiple items to store, use a list. To show a list, you need to concatenate the items, or show them one by one (`for message in date_alerts: print(message)` ) – Wiktor Stribiżew May 04 '20 at 13:21
0

You're looking for Series.str.match.

Essentially, you need to extract the dob series, which I assume is what you're doing with df['dob'], and do result = df['dob'].str.match(date_pattern). The result will be a series of True and False values, corresponding to their respective df['dob'] values.

Chase
  • 5,315
  • 2
  • 15
  • 41
  • I have that part but I'm trying to do a condition if all are true then continue but if not print a message saying where 'invalid dob' and showing where – JTHDR May 03 '20 at 17:39
  • @JTHDR I recommend reading the Pandas documentation on selecting data: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html – shadowtalker May 03 '20 at 17:57
  • @JTHDR you could simply just check the returned `result` for having `False` values. That's all you have to do – Chase May 03 '20 at 17:59
  • I keep running into " The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." errpr – JTHDR May 04 '20 at 00:48
  • @JTHDR Read https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o – Chase May 04 '20 at 05:37