One of my current Data Mining course project requires use of Twitter accounts that post similar (not exactly same) Tweets in a row. For this I had to identify Twitter accounts that that are posting similar (not exactly same) tweets in a row.
I decided to use the Snowflake’s versatile MATCH_RECOGNIZE for this.
Let’s start with some sample data:
Based on this, we need to identify Elena and Eva as they posting similar tweets in a row. Whereas Amy is posting unique tweets.
I used the following SQL for that:
ELECT USERID, REPEATING_TWEET FROM SCRATCH.SAQIB_ALI.TWEETS MATCH_RECOGNIZE( PARTITION BY USERID ORDER BY TWEETID ASC MEASURES FIRST(TWEET) AS REPEATING_TWEET ONE ROW PER MATCH PATTERN (SIMILAR+) DEFINE SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90 );