MATCH_RECOGNIZE operator allows detection of patterns in relational data. Specifically, it allows us to identify records that mark the beginning of a set of records that together form a pattern. In this set, each record satisfies a certain condition. The fact that in a set of records these conditions are met– in a certain order– is proof of the pattern.
The condition a record adheres to can be as simple as: value in column is higher than value in previous record. And a pattern can then be as simple as: for at least 11 subsequent records is the value increasing. Or the condition is: “value in a certain column is equal to a constant” and the pattern equals “at least 5 and no more than 10 subsequent records all have that constant’s value”. And of course the conditions can be far more variant and the patterns much more complex. These patterns are defined using RegEx in
One crucial element among all pattern matching in relational data is that the data needs to be ordered in some specific order. This can be by time (very common) but also along other axes – such as distance from a certain position, altitude or price [range] or age [category].
To demonstrate the versatility of
MATCH_RECOGNIZE, we will use an example of courses taken during a CSE program. While some 300 and 400 level courses can be taken in any order, there is a preferred sequence recommended by the department to best prepare the students for the course. For e.g. before taking the CSE 472/572 – Statistical Machine Learning, it is recommended that Students take the following courses:
- STP 216 – Element of Statistics
- MAT 225 – Elementary Linear Algebra
- STP 418 – Introductory Applied Statistics
STP 216 and MAT 225 can be taken in any order, but STP 418 should be taken after STP 216.
We wanted to see how many students follow this recommendation, and how well they scored in the CSE 472 compared to students who did not take these courses, or only took partial set of courses.
If it was matter of students taking these course consecutively, then a simple lead lag SQL query might have worked nicely. However that is not the reality. Students take these courses interleaved with other courses, and not necessarily in the above order. This is where the power of Regex in
MATCH_RECOGNIZE will help us.
To analyze data from 1000s of students with 100s of courses, we ended up using
MATCH_RECOGNIZE to establish the pattern and identify the students that followed the recommended order of courses.
select STUDENT_ID from scratch.saqib_ali.COURSE match_recognize( partition by STUDENT_ID order by COURSE_SEQ FOLLOWED_RECOMMENDATION ONE ROW PER MATCH PATTERN (OTHER* (STP216|MAT225) OTHER* (STP216|MAT225) OTHER* (STP418) OTHER* CSE472) DEFINE STP216 as iff(COURSE='STP 216', TRUE, FALSE), MAT225 as iff(COURSE='MAT 225', TRUE, FALSE), STP418 as iff(COURSE='STP 418', TRUE, FALSE), STP472 as iff(COURSE='CSE 472', TRUE, FALSE), OTHER as iff(COURSE LIKE '%', TRUE, FALSE) );