As part of my Data Mining class group project, I was tasked to identify (in near real-time) Twitter accounts that were posting similar (not exact same) tweets over and over. I ended up using Snowflake’s string similarity functions to determine the similarity between the tweets. Here is short overview of them.
Snowflake also has two String similarity functions—
EDITDISTANCE. Both of them try to quantify the differences between two strings, phrases, or sentences.
JAROWINKLER_SIMILARITY vs EDITDISTANCE
EDITDISTANCE aka Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Whereas the Jaro-Winkler algorithm is a measure of characters in common with the consideration that differences near the start of the string are more significant than differences near the end of the string.
Let’s experiment with the following two tweets:
Awesome new dish soap that gets the dishes sparkling clean
Awesome new dish soap that makes the dishes sparkle
select JAROWINKLER_SIMILARITY('Awesome new dish soap that gets the dishes sparkling clean', 'Awesome new dish soap that makes the dishes sparkle');
This will produce a Jaro-Winkler similarity score of
Since EDITDISTANCE is a count, we need to normalize it to get a score:
select 1-EDITDISTANCE('awesome new dish soap that gets the dishes sparkling clean', 'awesome new dish soap that makes the dishes sparkle') / greatest(length('awesome new dish soap that gets the dishes sparkling clean'), length('awesome new dish soap that makes the dishes sparkle'));
This will produce a Levenshtein similarity score of 79%.
Since I was looking at tweets which usually start with the same structure but are slightly modified midway or towards the end, I ended up using
The selection of the right similarity measure will depend on the application. But in general, Levenshtein distance works better on longer phrases and sentences. Jaro-Winkler works better for phrases that start the same way.
Analyzing streaming data (tweets)
The end-goal for my part of the project was to identify Twitter accounts that were posting similar tweet over and over. I used Snowflake MATCH_RECOGNIZE for this purpose. Snowflake
MATCH_RECOGNIZE clause is used to search for a pattern in a single field spanning over multiple records. This clause enables you to define patterns using regular expressions and aggregate methods to verify and extract values from the match.
SELECT USERID , NUM_OF_TWEETS , FIRST_TWEET , LAST_TWEET , FIRST_TWEET_ID , LAST_TWEET_ID FROM SCRATCH.SAQIB_ALI.TWEETS MATCH_RECOGNIZE( PARTITION BY USERID ORDER BY TWEETID ASC MEASURES match_number() as match_number, FIRST(TWEET) AS FIRST_TWEET, LAST(TWEET) AS LAST_TWEET, FIRST(TWEETID) AS FIRST_TWEET_ID, LAST(TWEETID) AS LAST_TWEET_ID, COUNT(*) AS NUM_OF_TWEETS ONE ROW PER MATCH PATTERN (SIMILAR_before SIMILAR_after+) DEFINE SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90, SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90 );