1

We are analyzing the streaming twitter data to find users who are posting similar (almost same) tweets over and over. I am using MATCH_RECOGNIZE for this. It is able to find the pattern, but I am not able to get the FIRST() and the LAST() values correctly. Here is sample dataset:

enter image description here

I am using the following Query:

SELECT 
  USERID
  , NUM_OF_TWEETS
  , FIRST_TWEET
  , LAST_TWEET
  , FIRST_TWEET_ID
  , LAST_TWEET_ID
FROM SCRATCH.SAQIB_ALI.TWEETS
MATCH_RECOGNIZE(
  PARTITION BY USERID
  ORDER BY TWEETID ASC
  MEASURES
    FIRST(TWEET) AS FIRST_TWEET,
    LAST(TWEET) AS LAST_TWEET,
    FIRST(TWEETID) AS FIRST_TWEET_ID,
    LAST(TWEETID) AS LAST_TWEET_ID,
    COUNT(*) AS NUM_OF_TWEETS
    
  ONE ROW PER MATCH
  PATTERN (SIMILAR+)
  DEFINE
    SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
    
);

This correct identify the users that are posting same tweets over an over: enter image description here

But I am not able to get the first tweet and the last tweet in the matching sequence.

Saqib Ali
  • 3,953
  • 10
  • 55
  • 100

2 Answers2

3

There are multiple things at play.

The first is you only "have one row trigging a match" so first and last are the second row of you data. This can be seen by changing to ALL ROWS PER MATCH

with tweets(userid, tweetid, tweet) as (
    select * from values
    ('elena', 1, 'aaa'),
    ('elena', 2, 'aaaa')
)
SELECT 
*
FROM TWEETS
MATCH_RECOGNIZE(
  PARTITION BY USERID
  ORDER BY TWEETID ASC
  MEASURES
    match_number() as match_number,
    FIRST(TWEET) AS FIRST_TWEET,
    LAST(TWEET) AS LAST_TWEET,
    FIRST(TWEETID) AS FIRST_TWEET_ID,
    LAST(TWEETID) AS LAST_TWEET_ID,
    COUNT(*) AS NUM_OF_TWEETS
  ALL ROWS PER MATCH
  PATTERN (SIMILAR+)
  DEFINE
    SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID TWEETID TWEET MATCH_NUMBER FIRST_TWEET LAST_TWEET FIRST_TWEET_ID LAST_TWEET_ID NUM_OF_TWEETS
elena 2 aaaa 1 aaaa aaaa 2 2 1

if you change to say a match that catches the first value and the lag values:

  ALL ROWS PER MATCH
  PATTERN (SIMILAR_before SIMILAR_after+)
  DEFINE
    SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
    SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90

you now match both the first and latter rows..

USERID TWEETID TWEET MATCH_NUMBER FIRST_TWEET LAST_TWEET FIRST_TWEET_ID LAST_TWEET_ID NUM_OF_TWEETS
elena 1 aaa 1 aaa aaa 1 1 1
elena 2 aaaa 1 aaa aaaa 1 2 2

now if we expand our test a little bit more with four rows of data:

with tweets(userid, tweetid, tweet) as (
    select * from values
    ('elena', 1, 'aaa'),
    ('elena', 2, 'aaaa'),
    ('elena', 3, 'aaa'),
    ('elena', 4, 'aaaa')
)
USERID TWEETID TWEET MATCH_NUMBER FIRST_TWEET LAST_TWEET FIRST_TWEET_ID LAST_TWEET_ID NUM_OF_TWEETS
elena 1 aaa 1 aaa aaa 1 1 1
elena 2 aaaa 1 aaa aaaa 1 2 2
elena 3 aaa 1 aaa aaa 1 3 3
elena 4 aaaa 1 aaa aaaa 1 4 4

we see those values are not double registering..

BUT we also see the first ID is correct for all rows, but the last is within the scope of the current matched row, so not after all matches as you are hoping.

If we flip back to one row per match we do how ever get the results we are expecting.

with tweets(userid, tweetid, tweet) as (
    select * from values
    ('elena', 1, 'aaa'),
    ('elena', 2, 'aaaa'),
    ('scott', 3, 'aaaa'),
    ('eva', 4, 'bbbb'),
    ('eva', 5, 'bbbbb'),
    ('amy', 4, 'eeee'),
    ('amy', 5, 'zzzz')
)
SELECT 
 USERID
  , NUM_OF_TWEETS
  , FIRST_TWEET
  , LAST_TWEET
  , FIRST_TWEET_ID
  , LAST_TWEET_ID
FROM TWEETS
MATCH_RECOGNIZE(
  PARTITION BY USERID
  ORDER BY TWEETID ASC
  MEASURES
    match_number() as match_number,
    FIRST(TWEET) AS FIRST_TWEET,
    LAST(TWEET) AS LAST_TWEET,
    FIRST(TWEETID) AS FIRST_TWEET_ID,
    LAST(TWEETID) AS LAST_TWEET_ID,
    COUNT(*) AS NUM_OF_TWEETS
  ONE ROW PER MATCH
  PATTERN (SIMILAR_before SIMILAR_after+)
  DEFINE
    SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
    SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID NUM_OF_TWEETS FIRST_TWEET LAST_TWEET FIRST_TWEET_ID LAST_TWEET_ID
elena 2 aaa aaaa 1 2
eva 2 bbbb bbbbb 4 5
Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
  • Very nice. I wasn't aware you could apply multiple defined patterns in the same MATCH_RECOGNIZE function. Interested in knowing whether having to apply multiple patterns would scale well over large volumes of data. – Jim Demitriou Jul 31 '22 at 05:34
  • 2
    The quality of this answer is awesome. Match Recognize is non-trivial this answer clearly articulates the answer along with explanation. 5 stars! – Adrian White Jul 31 '22 at 06:13
0

Naturally, I was working on this while Simeon was attacking the same problem. I ran into similar issues and noted the logic applied to the window frame, and therefore you needed to account for certain functions only working from the matched row, and then you would miss the first, et al.

I did an old-school approach, nesting views to incrementally address the problem.

Both solve the problem at hand - and while I like the use of MATCH_RECOGNIZE in the provided answer (it's more elegant as a single query), it may be difficult for others to understand.

--
-- create test table
--

create
or replace table tweets (
    userid varchar,
    tweetid integer,
    tweet varchar
);
--
-- create test data
--
insert into
    tweets (userid, tweetid, tweet)
values
    ('elena', 1, 'aaa'),
    ('elena', 2, 'aaaa'),
    ('scott', 3, 'aaaa'),
    ('eva', 4, 'bbbb'),
    ('eva', 5, 'bbbbb'),
    ('amy', 4, 'eeee'),
    ('amy', 5, 'zzzz');
--
-- Baseline view showing matching tweets by user
--
    CREATE
    OR REPLACE VIEW MATCHES AS (
        SELECT
            T1.USERID,
            T1.TWEETID AS TWEETID,
            T2.TWEETID AS MATCHING_TWEETID
        FROM
            TWEETS T1,
            TWEETS T2
        WHERE
            T1.USERID = T2.USERID
            AND JAROWINKLER_SIMILARITY(T1.TWEET, T2.TWEET) > 90
    );


--
-- create a view of non-repeating tweets
--
create or replace view single_tweets as (
        select
            userid,
            tweetid,
            count(*) as num_tweets
        from
            matches
        group by
            userid,
            tweetid
        having
            count(*) = 1
    );
    
    select * from single_tweets;
--
-- Create a view of only repeating tweets by tweetid
--
create
    or replace view repeating_tweets as (
        select
            userid,
            tweetid,
            matching_tweetid
        from
            matches
        where
            (userid, tweetid) not in (
                select
                    userid, tweetid
                from
                    single_tweets
            )
            and (userid,tweetid) not in (
                select
                    userid, tweetid
                from
                    matches
                where
                    matching_tweetid < tweetid
            )
        order by
            tweetid,
            matching_tweetid
    );


--
-- only report repeating tweets
--
select
    t.userid,
    min(t.tweet) as FIRST_TWEET,
    max(t.tweet) as LAST_TWEET,
    min(t.tweetid) as FIRST_TWEETID,
    max(t.tweetid) as LAST_TWEETID,
    count(rt.matching_tweetid) as num_tweets
from
    tweets t,
    repeating_tweets rt
where
    t.userid = rt.userid
    and t.tweetid = rt.matching_tweetid
group by
    t.userid,
    rt.tweetid;

Results:

USERID  FIRST_TWEET LAST_TWEET  FIRST_TWEETID   LAST_TWEETID    NUM_TWEETS
eva     bbbb        bbbbb       4               5               2
elena   aaa         aaaa        1               2               2
Jim Demitriou
  • 593
  • 4
  • 8