In RegexSolver, I needed an efficient way to represent automaton transition conditions in memory. This required a solution for manipulating and storing sets of characters extracted from regular expressions.
To address this, I created irange, a Rust library that I discuss further in this article.
While irange allowed me to efficiently store and manipulate integers, not all integers are valid UTF-8 characters. For example, the UTF-8 standard marks all characters between 0xD800 and 0xDFFF as invalid, as they are reserved for surrogate pairs in UTF-16.
To overcome this limitation, I extended irange to handle regular expression character classes. Here are the requirements for my new library:
The first two points were already addressed by irange, so my focus was on meeting the last two requirements.
The UTF-8 standard specifies that valid characters range from 0x0000 to 0x10FFFF, with the range 0xD800 to 0xDFFF reserved for surrogate pairs, making them invalid in UTF-8.
Rust’s char type already takes these restrictions into account, so it was a natural choice for building the solution. But to be able to use char in irange they need to implement the Add, Sub, and AddAssign traits. Unfortunately, char does not implement these traits, and Rust doesn’t allow us to implement traits for types outside the current package.
As a result, I created a new struct based on char and implemented the necessary traits:
static INVALID_MIN: u32 = 0xD800;
static INVALID_MAX: u32 = 0xDFFF;
static INVALID_SIZE: u32 = 0x800;
pub struct Char(char);
impl Add<Char> for Char {
type Output = Char;
fn add(self, rhs: Self) -> Self::Output {
let mut sum = self.0 as u32 + rhs.0 as u32;
if sum >= INVALID_MIN && sum <= INVALID_MAX {
sum = INVALID_MAX + 1 + sum - INVALID_MIN;
}
if let Some(new_char) = char::from_u32(sum) {
Char(new_char)
} else {
panic!("attempt to add with overflow");
}
}
}
impl Sub<Char> for Char {
type Output = Char;
fn sub(self, rhs: Self) -> Self::Output {
let mut minuhend = self.0 as u32;
if minuhend >= INVALID_MIN {
minuhend -= INVALID_SIZE;
}
let mut subtrahend = rhs.0 as u32;
if subtrahend >= INVALID_MIN {
subtrahend -= INVALID_SIZE;
}
let mut sub = minuhend - subtrahend;
if sub >= INVALID_MIN {
sub += INVALID_SIZE;
}
if let Some(new_char) = char::from_u32(sub) {
Char(new_char)
} else {
panic!("attempt to sub with overflow");
}
}
}
impl AddAssign for Char {
fn add_assign(&mut self, rhs: Self) {
self.0 = (*self + rhs).0;
}
}
impl Bounded for Char {
fn min_value() -> Self {
Char('\0')
}
fn max_value() -> Self {
Char(char::MAX)
}
fn one() -> Self {
Char('\u{1}')
}
}
In order to handle addition and subtraction of char, I implemented logic to "jump" over the forbidden character ranges.
The Bounded trait is required by irange to define the minimum and maximum values, as well as the "stepping behavior" when traversing values of a type.
Regular expressions support various ways to represent character classes:
[A-F] - A range of characters included in the class[^A-F] - A range of characters excluded from the class\w - Perl-style predefined character classes\p{ASCII_HEX_DIGIT} - Unicode named character classes\P{ASCII_HEX_DIGIT} - Exclusion of named character classesConverting a character range to a corresponding regular expression class is straightforward, but handling the more complex cases (such as predefined and named classes) requires identifying which class the range belongs to.
Fortunately, I didn’t need to hardcode these mappings. By using ucd-generate, I could generate all the necessary ranges for each character class:
ucd-generate general-category /tmp/ucd-16.0.0 --chars --exclude surrogate > src/tokens/unicode/general_category.rs
ucd-generate general-category /tmp/ucd-16.0.0 --chars --include decimalnumber > src/tokens/unicode/perl_decimal.rs
ucd-generate property-bool /tmp/ucd-16.0.0 --chars --include whitespace > src/tokens/unicode/perl_space.rs
ucd-generate perl-word /tmp/ucd-16.0.0 --chars > src/tokens/unicode/perl_word.rs
ucd-generate property-bool /tmp/ucd-16.0.0 --chars > src/tokens/unicode/property_bool.rs
ucd-generate script /tmp/ucd-16.0.0 --chars > src/tokens/unicode/script.rs
By comparing character ranges with the generated Unicode classes, I was able to easily map ranges to their corresponding valid regular expression classes.
Here is an example of what you can do with regex-charclass:
use regex_charclass::{irange::{RangeSet, range::AnyRange}, char::Char, CharacterClass};
let range1 = RangeSet::new_from_range_char('a'..='z');
assert_eq!(26, range1.get_cardinality());
assert_eq!("[a-z]", range1.to_regex());
let range2 = RangeSet::new_from_ranges(&[
AnyRange::from(Char::new('0')..=Char::new('9')),
AnyRange::from(Char::new('A')..=Char::new('F')),
AnyRange::from(Char::new('a')..=Char::new('f')),
]);
assert_eq!("\\p{ASCII_Hex_Digit}", range2.to_regex());
let range2_complement = range2.complement();
assert_eq!("\\P{ASCII_Hex_Digit}", range2_complement.to_regex());
assert_eq!(".", range2.union(&range2_complement).to_regex());
assert_eq!("[]", range2.intersection(&range2_complement).to_regex());
assert_eq!("[g-z]", range1.difference(&range2).to_regex());
If you’re interested in using regex-charclass in your project, you can find it on crates.io and check out the full documentation here.
regex-charclass is an open-source library, and I welcome contributions! If you’d like to help out, visit the GitHub repository. I look forward to hearing your feedback and seeing your contributions.