Snowflake MATCH_RECOGNIZE for finding users that are posting similar (not exactly same) posts in a row


One of my current Data Mining course project requires use of Twitter accounts that post similar (not exactly same) Tweets in a row. For this I had to identify Twitter accounts that that are posting similar (not exactly same) tweets in a row.

I decided to use the Snowflake’s versatile MATCH_RECOGNIZE for this.

Let’s start with some sample data:

Tweet data for Eva, Amy and Elena

Based on this, we need to identify Elena and Eva as they posting similar tweets in a row. Whereas Amy is posting unique tweets.

I used the following SQL for that:

ELECT 
  USERID,
  REPEATING_TWEET
FROM SCRATCH.SAQIB_ALI.TWEETS
MATCH_RECOGNIZE(
  PARTITION BY USERID
  ORDER BY TWEETID ASC
  MEASURES
    FIRST(TWEET) AS REPEATING_TWEET    
  ONE ROW PER MATCH
  PATTERN (SIMILAR+)
  DEFINE
    SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
    
);
MATCH_RECOGNIZE Query