Get the count of Unique time stamps from a text column in a dataframe in python

Question

I have a dataframe which has around 110 columns and around 2 million rows. I want to find the count of unique date count in each row from a column called comments. The 'Comments' column look something like below

------------------------------------------------------------------------
ID       Comments
------------------------------------------------------------------------
1        Log Type: customer chat
         chat history:
            xxxxxxxxx
            xxxxxxx
            xxxxxxxxxxxxxxx
            May 10 2020 23:34:57 +GMT 05:30
            --------------------------------------------
            log type: Phone call
            issue type: xxxxxx
            issue:
             qqqqqqqqqqqq
             qqqqqqqqqqqqqqqqqqqqqqq
             qqqqqqqqqqqqqqq
             May 11 2020 08:54:54 + GMT 05:30
             ----------------------------------------------
             log type: phone call
             issue:
              eeeeeeeeeeeeee
              eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
              eeeeeeee
              eeeeeeeeeee
              eeeeeeeeeeee
              eeeeeeeeeeeeeeeeeee
              May 11 2020 14:58:54 + GMT 05:30
            ----------------------------------
----------------------------------------------------------------------------
2           Log Type: Phone call
            issue:
            xxxxxxxxx
            xxxxxxx
            xxxxxxxxxxxxxxx
            May 10 2020 23:34:57 +GMT 05:30
            --------------------------------------------
            log type: Phone call
            issue type: xxxxxx
            issue:
             qqqqqqqqqqqq
             qqqqqqqqqqqqqqqqqqqqqqq
             qqqqqqqqqqqqqqq
             May 11 2020 08:54:54 + GMT 05:30
             ----------------------------------------------
             log type: phone call
             issue:
               eeeeeeeeeeeeee
               eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
               eeeeeeee
               eeeeeeeeeee
               eeeeeeeeeeee
               eeeeeeeeeeeeeeeeeee
             5/12/2020 14:58:54 + GMT 05:30
            ----------------------------------------------

The desired output is as given below

ID Count
1   2
2   3

can anyone help on this?

have you tried cleaning up the column by identifying the different date formats that would be found in the string? do you know how to split the lines into a list of strings? please include more detail on what you've tried so the answers can guide you from there — RichieV, Oct 27 '20 at 05:15
I have tried using search_dates without lambda it is running but takes lot of time to run is there a better way? since it is running for more than 3 days now — Krishnamurthy Narayanaswamy, Oct 27 '20 at 05:20
is the date ALWAYS the last line before a line of hyphens? if so, then research how to do a regex lookahead as in [this answer](https://stackoverflow.com/a/47887112/6692898)... which then you can use with `series.str.extract_all`, that should speed things up — RichieV, Oct 27 '20 at 14:11

score 1 · Answer 1 · answered Oct 27 '20 at 05:34

1

try this

import re

def count(x):
    comments = x['Comments']
    date_list = re.findall(r"[A-Za-z]{3}\s\d+\s\d{4}", comments)
    count = len(set(date_list))
    return count

df['count'] = df.apply(count, axis=1)
print(df[['Comments', 'count']])

answered Oct 27 '20 at 05:34

Raghav Sharma

195
4

Thanks a lot Raghav for the answer. Can you but help me to run this on 2 MN records using vectorization and parallelization techniques please – Krishnamurthy Narayanaswamy Oct 28 '20 at 19:19
you can use modin. see https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html – Raghav Sharma Oct 29 '20 at 07:09

sharathnatraj · Accepted Answer · 2020-10-29T09:49:38.147

1

Edited the answer based on comments:

1.Getting all the dates first. Note that the regex in str.findall includes patterns to match the formats of "MAY 20 2020" or "5/12/2020" or "05/12/2020"

s = df['Comments'].str.findall(r'[\w\s\.]*(\w{3}\s\d{2}\s\d{4}|\d?\d/\d?\d/\d{4})[\w\s\.]*')
print(s)
0    [May 10 2020, May 11 2020, May 11 2020]
1      [May 10 2020, May 11 2020, 5/12/2020]

2.Above returns a list. Now, we have to standardize the date format to one standard format.

def conv(x):
    for val in x:
        if re.match("\d?\d/\d?\d/\d{4}",val) != None:
            x.remove(val)
            val = datetime.datetime.strptime(val, '%m/%d/%Y').strftime('%b %d %Y')
            x.append(val)
    return x
s.apply(lambda x: conv(x))
0    [May 10 2020, May 11 2020, May 11 2020]
1    [May 10 2020, May 11 2020, May 12 2020]

Now, we can extract the unique counts from the series and then add the column "Count" in the original df.

df['count'] = s.transform(set).str.len()
print(df)
   ID                                           Comments  count
0   1  Log Type: customer chat chat history: xxxxxxxx...      2
1   2  Log Type: Phone call issue: xxxxxxxxx xxxxxxx ...      3

edited Oct 29 '20 at 09:49

answered Oct 27 '20 at 06:14

sharathnatraj

1,614
5
14

It is not compulsory to be in the same format that is the catch – Krishnamurthy Narayanaswamy Oct 28 '20 at 19:22
Isn't Raghavs answer also assuming thr date to be in standard format? – sharathnatraj Oct 29 '20 at 02:16
How many different formats of dates are there.?. we can write a generalized regex if you know all the different formats. – sharathnatraj Oct 29 '20 at 02:17
it can be either 4 May 2020 or 5/4/2020 or 04/05/2020. But the issue is how to run it on 2 MN records – Krishnamurthy Narayanaswamy Oct 29 '20 at 07:26
I have edited the answer to include the patterns for different date types. I was trying to get away with any loops but had to run "apply" because of the different date formats. Run this on your data and check. – sharathnatraj Oct 29 '20 at 09:51
THank you Sharath it did work. Also another query is if in case i want to count unique dates for those where you have "phone Log" is it possible? if so can you please help me on that? i.e where text = "Phone Log" it has count unique dates and give the count as a output – Krishnamurthy Narayanaswamy Nov 02 '20 at 11:18

Get the count of Unique time stamps from a text column in a dataframe in python

2 Answers2