Snowflake JAROWINKLER_SIMILARITY vs EDITDISTANCE for finding similar tweets


As part of my Data Mining class group project, I was tasked to identify (in near real-time) Twitter accounts that were posting similar (not exact same) tweets over and over. I ended up using Snowflake’s string similarity functions to determine the similarity between the tweets. Here is short overview of them.

Snowflake also has two String similarity functions— JAROWINKLER_SIMILARITY and EDITDISTANCE. Both of them try to quantify the differences between two strings, phrases, or sentences.

JAROWINKLER_SIMILARITY vs EDITDISTANCE

EDITDISTANCE aka Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Whereas the Jaro-Winkler algorithm is a measure of characters in common with the consideration that differences near the start of the string are more significant than differences near the end of the string.

Let’s experiment with the following two tweets:
Awesome new dish soap that gets the dishes sparkling clean
Awesome new dish soap that m
akes the dishes sparkle

JAROWINKLER_SIMILARITY
select JAROWINKLER_SIMILARITY('Awesome new dish soap that gets the dishes sparkling clean', 'Awesome new dish soap that makes the dishes sparkle');

This will produce a Jaro-Winkler similarity score of 92.

EDITDISTANCE

Since EDITDISTANCE is a count, we need to normalize it to get a score:

select 1-EDITDISTANCE('awesome new dish soap that gets the dishes sparkling clean', 'awesome new dish soap that makes the dishes sparkle')
/ greatest(length('awesome new dish soap that gets the dishes sparkling clean'), length('awesome new dish soap that makes the dishes sparkle'));

This will produce a Levenshtein similarity score of 79%.

Since I was looking at tweets which usually start with the same structure but are slightly modified midway or towards the end, I ended up using JAROWINKLER_SIMILARITY

The selection of the right similarity measure will depend on the application. But in general, Levenshtein distance works better on longer phrases and sentences. Jaro-Winkler works better for phrases that start the same way.

Analyzing streaming data (tweets)

The end-goal for my part of the project was to identify Twitter accounts that were posting similar tweet over and over. I used Snowflake MATCH_RECOGNIZE for this purpose. Snowflake MATCH_RECOGNIZE clause is used to search for a pattern in a single field spanning over multiple records. This clause enables you to define patterns using regular expressions and aggregate methods to verify and extract values from the match.

SELECT 
 USERID
  , NUM_OF_TWEETS
  , FIRST_TWEET
  , LAST_TWEET
  , FIRST_TWEET_ID
  , LAST_TWEET_ID
FROM SCRATCH.SAQIB_ALI.TWEETS
MATCH_RECOGNIZE(
  PARTITION BY USERID
  ORDER BY TWEETID ASC
  MEASURES
    match_number() as match_number,
    FIRST(TWEET) AS FIRST_TWEET,
    LAST(TWEET) AS LAST_TWEET,
    FIRST(TWEETID) AS FIRST_TWEET_ID,
    LAST(TWEETID) AS LAST_TWEET_ID,
    COUNT(*) AS NUM_OF_TWEETS
  ONE ROW PER MATCH
  PATTERN (SIMILAR_before SIMILAR_after+)
  DEFINE
    SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
    SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
Use oif JAROWINKLER_SIMILARITY and MATCH_RECOGNIZE to identify users that are posting similar tweets