There are multiple things at play.
The first is you only "have one row trigging a match" so first and last are the second row of you data. This can be seen by changing to ALL ROWS PER MATCH
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa')
)
SELECT
*
FROM TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
match_number() as match_number,
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ALL ROWS PER MATCH
PATTERN (SIMILAR+)
DEFINE
SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID |
TWEETID |
TWEET |
MATCH_NUMBER |
FIRST_TWEET |
LAST_TWEET |
FIRST_TWEET_ID |
LAST_TWEET_ID |
NUM_OF_TWEETS |
elena |
2 |
aaaa |
1 |
aaaa |
aaaa |
2 |
2 |
1 |
if you change to say a match that catches the first value and the lag values:
ALL ROWS PER MATCH
PATTERN (SIMILAR_before SIMILAR_after+)
DEFINE
SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
you now match both the first and latter rows..
USERID |
TWEETID |
TWEET |
MATCH_NUMBER |
FIRST_TWEET |
LAST_TWEET |
FIRST_TWEET_ID |
LAST_TWEET_ID |
NUM_OF_TWEETS |
elena |
1 |
aaa |
1 |
aaa |
aaa |
1 |
1 |
1 |
elena |
2 |
aaaa |
1 |
aaa |
aaaa |
1 |
2 |
2 |
now if we expand our test a little bit more with four rows of data:
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa'),
('elena', 3, 'aaa'),
('elena', 4, 'aaaa')
)
USERID |
TWEETID |
TWEET |
MATCH_NUMBER |
FIRST_TWEET |
LAST_TWEET |
FIRST_TWEET_ID |
LAST_TWEET_ID |
NUM_OF_TWEETS |
elena |
1 |
aaa |
1 |
aaa |
aaa |
1 |
1 |
1 |
elena |
2 |
aaaa |
1 |
aaa |
aaaa |
1 |
2 |
2 |
elena |
3 |
aaa |
1 |
aaa |
aaa |
1 |
3 |
3 |
elena |
4 |
aaaa |
1 |
aaa |
aaaa |
1 |
4 |
4 |
we see those values are not double registering..
BUT we also see the first ID is correct for all rows, but the last is within the scope of the current matched row, so not after all matches as you are hoping.
If we flip back to one row per match
we do how ever get the results we are expecting.
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa'),
('scott', 3, 'aaaa'),
('eva', 4, 'bbbb'),
('eva', 5, 'bbbbb'),
('amy', 4, 'eeee'),
('amy', 5, 'zzzz')
)
SELECT
USERID
, NUM_OF_TWEETS
, FIRST_TWEET
, LAST_TWEET
, FIRST_TWEET_ID
, LAST_TWEET_ID
FROM TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
match_number() as match_number,
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ONE ROW PER MATCH
PATTERN (SIMILAR_before SIMILAR_after+)
DEFINE
SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID |
NUM_OF_TWEETS |
FIRST_TWEET |
LAST_TWEET |
FIRST_TWEET_ID |
LAST_TWEET_ID |
elena |
2 |
aaa |
aaaa |
1 |
2 |
eva |
2 |
bbbb |
bbbbb |
4 |
5 |