MATCH_RECOGNIZE for performing A/B analysis on streaming data


Earlier this year our non-profit ran a plant a tree campaign. We had an awesome response. We made two different campaign pages to see which was more effective in converting the website visit to a donation. We wanted to make the A/B analysis real-time in order to improve the campaign descriptions to drive more donations.

The web session data was streamed to Snowflake. Note: web visitor is assigned a Session ID.

The data looks as following (PII fields are redacted):

We had two campaign pages designed for the A/B analysis. The conversion success criteria was

VISIT -> CAMPAIGN_PAGE -> DONATION_PAGE.

The end goal of the A/B analysis was to see which CAMPAIGN page performed better i.e. plant_a_tree_campaign_a.html vs. plant_a_tree_campaign_b.html.

While we usually perform these types of Analysis using Knime, we wanted to see if we can do this in Snowflake in real-time. For this purpose I used the MATCH_RECOGNIZE query in Snowflake.

Here is the query we ended up using. It is fairly easy to understand and easily extendible for other use cases with more than 3 conversion funnel steps.

select session_id, donated
from webstream
match_recognize(
  partition by session_id
  order by visit_timestamp
  measures
    iff(count(*)=3, TRUE, FALSE) as donated
  ONE ROW PER MATCH
  PATTERN (VISIT READ DONATE)
  DEFINE
    VISIT as iff(URI='index.html', TRUE, FALSE),
    READ as iff(URI='plant_a_tree_campaign_a.html', TRUE,     FALSE),
    DONATE as iff(URI='donate.html', TRUE, FALSE)
);

MATCH_RECOGNIZE Query Structure

  1. We partitioned the web data by the SESSION_ID, and ORDERed it by the visit_timestamp
  2. We defined the VISIT as the value of index.html in URI.
  3. READ as the value of the campaign page. plant_a_tree_campaign_a.html OR plant_a_tree_campaign_b.html
  4. we defined  DONATE as the visit to “donate.html”
  5. Finally we wanted to match the PATTERN VISIT -> READ (campaign page) -> DONATE

As you can see the query explains itself.

Since we were comparing plant_a_tree_campaign_a.html vs. plant_a_tree_campaign_b.html, we just had to replace that in the READ definition:

select session_id, donated
from webstream
match_recognize(
  partition by session_id
  order by visit_timestamp
  measures
    iff(count(*)=3, TRUE, FALSE) as donated
  ONE ROW PER MATCH
  PATTERN (VISIT READ DONATE)
  DEFINE
    VISIT as iff(URI='index.html', TRUE, FALSE),
    READ as iff(URI='plant_a_tree_campaign_b.html', TRUE,     FALSE),
    DONATE as iff(URI='donate.html', TRUE, FALSE)
);

Now ran these queries on 2 mins worth of streaming data.

We see that we got 4 donations in 2 mins for 5 visitors that saw plant_a_tree_campaign_a.html

whereas we only got 1 donation in 2 mins for 5 visitors that saw plant_a_tree_campaign_b.html

We were able to determine that plant_a_tree_campaign_b.html was not performing that well, and updated it accordingly, and kept performing the A/B analysis on the streaming data.

MATCH_RECOGNIZE was instrumental in performing the real-time A/B analysis of the campaign descriptions and improving them as needed.