I've been interested in regular expressions (regex) for as long as I am coding. My initial encounter was during my first programming project: a simple web crawler in Perl. The task of extracting specific information from HTML led me to discover regex. The concept of matching patterns across diverse webpages seemed both magical and elegant.
Since then, my fascination has only grown. I've dug into the mechanics of regex parsing engines, their lesser-known features, and their limitations. This journey has earned me profound respect for those who have crafted these complex engines.
A few years ago, as I was graduating from university, I joined a consulting firm as a junior software engineer. My department provided consulting services and solutions to companies using the SWIFT banking network, including software like SWIFT Alliance Access (SAA) and Alliance Messaging Hub (AMH). These applications feature Business Rules Management Software (BRMS), where users write rules to filter and route financial messages.
My responsibility was to improve a tool called the Alliance Routing Tool (ART), which introduced version control for business rules and allowed users to simulate rule behavior against mock financial messages. This tool became integral to our clients' operations, allowing them to validate changes before deploying them to SAA or AMH.
Rules typically looked like this:
(field1 = [regex] OR field2 = [regex]) AND field3 > [number]
These systems managed thousands of such rules. A single change could unexpectedly affect others, risking disruptions in banking operations.
To mitigate these risks, I explored ways to predict the impact of changes to the rules. Consider a set of rules like:
R1: field1 = 1 AND field2 = "1[0-9][0-9]"
R2: field1 = 2 AND field2 = "103"
R3: field1 = 2 AND field2 = "10[0-9]"
R4: field1 = 2 AND field2 = "1[0-9][0-9]"
R5: field1 = 1 AND field2 = "103"
If modifications are made to R1 and R2, such as:
R1: field1 = 1 AND field2 = "109"
R2: field1 = 2 AND field2 = "10[4-9]"
R3: field1 = 2 AND field2 = "10[0-9]"
R4: field1 = 2 AND field2 = "1[0-9][0-9]"
R5: field1 = 1 AND field2 = "103"
What would be the impact on all the rules?
To answer this question we need to compute the set of messages captures by each rule before and after the modification:
before(R1): field1 = 1 AND field2 = "1[0-9][0-9]"
after (R1): field1 = 1 AND field2 = "109"
before(R2): field1 = 2 AND field2 = "103"
after (R2): field1 = 2 AND field2 = "10[4-9]"
before(R3): field1 = 2 AND field2 = "10[0-24-9]"
after (R3): field1 = 2 AND field2 = "10[0-3]"
before(R4): field1 = 2 AND field2 = "1[1-9][0-9]"
after (R4): field1 = 2 AND field2 = "1[1-9][0-9]"
before(R5):
after (R5): field1 = 1 AND field2 = "103"
By subtracting before - after
we get the sets of messages that are no longer matched and after - before
we get the messages that are newly matched for each rule.
With this information the user is able to assess if his modifications have the expected impact.
However, manipulating regex is complex, we need first to convert them to finite automata, perform the operation, then convert the result back to a regular expression pattern. Subtraction for instance is a very computing intensive operation as it requires to determinize the automaton - a task with exponential complexity.
The regex .*abc as an automaton
The same automaton after being determinized
I used a Java library, dk.brics.automaton, which among other features can convert regex patterns into automata and performs the subtraction. Converting the results back to a simple regex pattern, however, remained a challenge, as most algorithms prioritize validity over readability.
It was working great on small sets of rules as long as the regex used were not too complicated. Still handling more complex regex in a large number of rules was completely impossible.
The number of features that would be possible in business rules management software if this technical challenge was lifted.
One year ago I start thinking about this problem again, especially the aspect of manipulating automata before converting them back to a pattern. I gave myself the challenge to implement a tool that could manipulate regex in the most efficient way possible and output compact and simple regular expression.
After more than a year of hardwork I was able to implement a program that can perform operations such as intersection, union and subtraction between multiple regex and returning simple and compact patterns at a speed never reach on any project I have seen online.
So far I do not have a lot of users and I am sure that I can still greatly improve my implementation. If you are interested in testing it and giving me feedback I would greatly appreciate.
My dream is that one day my tool will be able to completely change the way business rules management software are designed. There is still a long way to go.
My project is available as a free API for Java, Node.js and Python at the following address: https://regexsolver.com/, you can also try the online demo to experiment directly in your web browser: https://regexsolver.com/demo.