0

I have a dataframe which has around 110 columns and around 2 million rows. I want to find the count of unique date count in each row from a column called comments. The 'Comments' column look something like below

------------------------------------------------------------------------
ID       Comments
------------------------------------------------------------------------
1        Log Type: customer chat
         chat history:
            xxxxxxxxx
            xxxxxxx
            xxxxxxxxxxxxxxx
            May 10 2020 23:34:57 +GMT 05:30
            --------------------------------------------
            log type: Phone call
            issue type: xxxxxx
            issue:
             qqqqqqqqqqqq
             qqqqqqqqqqqqqqqqqqqqqqq
             qqqqqqqqqqqqqqq
             May 11 2020 08:54:54 + GMT 05:30
             ----------------------------------------------
             log type: phone call
             issue:
              eeeeeeeeeeeeee
              eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
              eeeeeeee
              eeeeeeeeeee
              eeeeeeeeeeee
              eeeeeeeeeeeeeeeeeee
              May 11 2020 14:58:54 + GMT 05:30
            ----------------------------------
----------------------------------------------------------------------------
2           Log Type: Phone call
            issue:
            xxxxxxxxx
            xxxxxxx
            xxxxxxxxxxxxxxx
            May 10 2020 23:34:57 +GMT 05:30
            --------------------------------------------
            log type: Phone call
            issue type: xxxxxx
            issue:
             qqqqqqqqqqqq
             qqqqqqqqqqqqqqqqqqqqqqq
             qqqqqqqqqqqqqqq
             May 11 2020 08:54:54 + GMT 05:30
             ----------------------------------------------
             log type: phone call
             issue:
               eeeeeeeeeeeeee
               eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
               eeeeeeee
               eeeeeeeeeee
               eeeeeeeeeeee
               eeeeeeeeeeeeeeeeeee
             5/12/2020 14:58:54 + GMT 05:30
            ----------------------------------------------

The desired output is as given below

ID Count
1   2
2   3

can anyone help on this?

  • 1
    have you tried cleaning up the column by identifying the different date formats that would be found in the string? do you know how to split the lines into a list of strings? please include more detail on what you've tried so the answers can guide you from there – RichieV Oct 27 '20 at 05:15
  • I have tried using search_dates without lambda it is running but takes lot of time to run is there a better way? since it is running for more than 3 days now – Krishnamurthy Narayanaswamy Oct 27 '20 at 05:20
  • is the date ALWAYS the last line before a line of hyphens? if so, then research how to do a regex lookahead as in [this answer](https://stackoverflow.com/a/47887112/6692898)... which then you can use with `series.str.extract_all`, that should speed things up – RichieV Oct 27 '20 at 14:11

2 Answers2

1

try this

import re

def count(x):
    comments = x['Comments']
    date_list = re.findall(r"[A-Za-z]{3}\s\d+\s\d{4}", comments)
    count = len(set(date_list))
    return count

df['count'] = df.apply(count, axis=1)
print(df[['Comments', 'count']])
1

Edited the answer based on comments:

1.Getting all the dates first. Note that the regex in str.findall includes patterns to match the formats of "MAY 20 2020" or "5/12/2020" or "05/12/2020"

s = df['Comments'].str.findall(r'[\w\s\.]*(\w{3}\s\d{2}\s\d{4}|\d?\d/\d?\d/\d{4})[\w\s\.]*')
print(s)
0    [May 10 2020, May 11 2020, May 11 2020]
1      [May 10 2020, May 11 2020, 5/12/2020]

2.Above returns a list. Now, we have to standardize the date format to one standard format.

def conv(x):
    for val in x:
        if re.match("\d?\d/\d?\d/\d{4}",val) != None:
            x.remove(val)
            val = datetime.datetime.strptime(val, '%m/%d/%Y').strftime('%b %d %Y')
            x.append(val)
    return x
s.apply(lambda x: conv(x))
0    [May 10 2020, May 11 2020, May 11 2020]
1    [May 10 2020, May 11 2020, May 12 2020]

Now, we can extract the unique counts from the series and then add the column "Count" in the original df.

df['count'] = s.transform(set).str.len()
print(df)
   ID                                           Comments  count
0   1  Log Type: customer chat chat history: xxxxxxxx...      2
1   2  Log Type: Phone call issue: xxxxxxxxx xxxxxxx ...      3
sharathnatraj
  • 1,614
  • 5
  • 14