In RegexSolver, I needed an efficient way to represent automaton transition conditions in memory. This required a solution for manipulating and storing sets of characters extracted from regular expressions.
To address this, I created irange, a Rust library that I discuss further in this article.
While irange allowed me to efficiently store and manipulate integers, not all integers are valid UTF-8 characters. For example, the UTF-8 standard marks all characters between 0xD800
and 0xDFFF
as invalid, as they are reserved for surrogate pairs in UTF-16.
To overcome this limitation, I extended irange to handle regular expression character classes. Here are the requirements for my new library:
The first two points were already addressed by irange, so my focus was on meeting the last two requirements.
The UTF-8 standard specifies that valid characters range from 0x0000
to 0x10FFFF
, with the range 0xD800
to 0xDFFF
reserved for surrogate pairs, making them invalid in UTF-8.
Rust’s char
type already takes these restrictions into account, so it was a natural choice for building the solution. But to be able to use char
in irange
they need to implement the Add
, Sub
, and AddAssign
traits. Unfortunately, char does not implement these traits, and Rust doesn’t allow us to implement traits for types outside the current package.
As a result, I created a new struct based on char and implemented the necessary traits:
static INVALID_MIN: u32 = 0xD800;
static INVALID_MAX: u32 = 0xDFFF;
static INVALID_SIZE: u32 = 0x800;
pub struct Char(char);
impl Add<Char> for Char {
type Output = Char;
fn add(self, rhs: Self) -> Self::Output {
let mut sum = self.0 as u32 + rhs.0 as u32;
if sum >= INVALID_MIN && sum <= INVALID_MAX {
sum = INVALID_MAX + 1 + sum - INVALID_MIN;
}
if let Some(new_char) = char::from_u32(sum) {
Char(new_char)
} else {
panic!("attempt to add with overflow");
}
}
}
impl Sub<Char> for Char {
type Output = Char;
fn sub(self, rhs: Self) -> Self::Output {
let mut minuhend = self.0 as u32;
if minuhend >= INVALID_MIN {
minuhend -= INVALID_SIZE;
}
let mut subtrahend = rhs.0 as u32;
if subtrahend >= INVALID_MIN {
subtrahend -= INVALID_SIZE;
}
let mut sub = minuhend - subtrahend;
if sub >= INVALID_MIN {
sub += INVALID_SIZE;
}
if let Some(new_char) = char::from_u32(sub) {
Char(new_char)
} else {
panic!("attempt to sub with overflow");
}
}
}
impl AddAssign for Char {
fn add_assign(&mut self, rhs: Self) {
self.0 = (*self + rhs).0;
}
}
impl Bounded for Char {
fn min_value() -> Self {
Char('\0')
}
fn max_value() -> Self {
Char(char::MAX)
}
fn one() -> Self {
Char('\u{1}')
}
}
In order to handle addition and subtraction of char
, I implemented logic to "jump" over the forbidden character ranges.
The Bounded
trait is required by irange
to define the minimum and maximum values, as well as the "stepping behavior" when traversing values of a type.
Regular expressions support various ways to represent character classes:
[A-F]
- A range of characters included in the class[^A-F]
- A range of characters excluded from the class\w
- Perl-style predefined character classes\p{ASCII_HEX_DIGIT}
- Unicode named character classes\P{ASCII_HEX_DIGIT}
- Exclusion of named character classesConverting a character range to a corresponding regular expression class is straightforward, but handling the more complex cases (such as predefined and named classes) requires identifying which class the range belongs to.
Fortunately, I didn’t need to hardcode these mappings. By using ucd-generate
, I could generate all the necessary ranges for each character class:
ucd-generate general-category /tmp/ucd-16.0.0 --chars --exclude surrogate > src/tokens/unicode/general_category.rs
ucd-generate general-category /tmp/ucd-16.0.0 --chars --include decimalnumber > src/tokens/unicode/perl_decimal.rs
ucd-generate property-bool /tmp/ucd-16.0.0 --chars --include whitespace > src/tokens/unicode/perl_space.rs
ucd-generate perl-word /tmp/ucd-16.0.0 --chars > src/tokens/unicode/perl_word.rs
ucd-generate property-bool /tmp/ucd-16.0.0 --chars > src/tokens/unicode/property_bool.rs
ucd-generate script /tmp/ucd-16.0.0 --chars > src/tokens/unicode/script.rs
By comparing character ranges with the generated Unicode classes, I was able to easily map ranges to their corresponding valid regular expression classes.
Here is an example of what you can do with regex-charclass
:
use regex_charclass::{irange::{RangeSet, range::AnyRange}, char::Char, CharacterClass};
let range1 = RangeSet::new_from_range_char('a'..='z');
assert_eq!(26, range1.get_cardinality());
assert_eq!("[a-z]", range1.to_regex());
let range2 = RangeSet::new_from_ranges(&[
AnyRange::from(Char::new('0')..=Char::new('9')),
AnyRange::from(Char::new('A')..=Char::new('F')),
AnyRange::from(Char::new('a')..=Char::new('f')),
]);
assert_eq!("\\p{ASCII_Hex_Digit}", range2.to_regex());
let range2_complement = range2.complement();
assert_eq!("\\P{ASCII_Hex_Digit}", range2_complement.to_regex());
assert_eq!(".", range2.union(&range2_complement).to_regex());
assert_eq!("[]", range2.intersection(&range2_complement).to_regex());
assert_eq!("[g-z]", range1.difference(&range2).to_regex());
If you’re interested in using regex-charclass
in your project, you can find it on crates.io and check out the full documentation here.
regex-charclass
is an open-source library, and I welcome contributions! If you’d like to help out, visit the GitHub repository. I look forward to hearing your feedback and seeing your contributions.