alexvbrdn

regex-charclass: A Rust Library for Manipulating Regular Expression Character Classes

2024-11-17 Dev

In RegexSolver, I needed an efficient way to represent automaton transition conditions in memory. This required a solution for manipulating and storing sets of characters extracted from regular expressions.

To address this, I created irange, a Rust library that I discuss further in this article.

While irange allowed me to efficiently store and manipulate integers, not all integers are valid UTF-8 characters. For example, the UTF-8 standard marks all characters between 0xD800 and 0xDFFF as invalid, as they are reserved for surrogate pairs in UTF-16.

To overcome this limitation, I extended irange to handle regular expression character classes. Here are the requirements for my new library:

  1. Ability to perform operations like intersection, union, and complement
  2. Memory efficiency
  3. Support only valid UTF-8 characters
  4. Can be converted to valid regular expression character classes

The first two points were already addressed by irange, so my focus was on meeting the last two requirements.

Ensuring Only Valid UTF-8 Characters

The UTF-8 standard specifies that valid characters range from 0x0000 to 0x10FFFF, with the range 0xD800 to 0xDFFF reserved for surrogate pairs, making them invalid in UTF-8.

Rust’s char type already takes these restrictions into account, so it was a natural choice for building the solution. But to be able to use char in irange they need to implement the Add, Sub, and AddAssign traits. Unfortunately, char does not implement these traits, and Rust doesn’t allow us to implement traits for types outside the current package.

As a result, I created a new struct based on char and implemented the necessary traits:

static INVALID_MIN: u32 = 0xD800;
static INVALID_MAX: u32 = 0xDFFF;
static INVALID_SIZE: u32 = 0x800;

pub struct Char(char);

impl Add<Char> for Char {
    type Output = Char;

    fn add(self, rhs: Self) -> Self::Output {
        let mut sum = self.0 as u32 + rhs.0 as u32;
        if sum >= INVALID_MIN && sum <= INVALID_MAX {
            sum = INVALID_MAX + 1 + sum - INVALID_MIN;
        }
        if let Some(new_char) = char::from_u32(sum) {
            Char(new_char)
        } else {
            panic!("attempt to add with overflow");
        }
    }
}

impl Sub<Char> for Char {
    type Output = Char;

    fn sub(self, rhs: Self) -> Self::Output {
        let mut minuhend = self.0 as u32;
        if minuhend >= INVALID_MIN {
            minuhend -= INVALID_SIZE;
        }
        let mut subtrahend = rhs.0 as u32;
        if subtrahend >= INVALID_MIN {
            subtrahend -= INVALID_SIZE;
        }
        let mut sub = minuhend - subtrahend;
        if sub >= INVALID_MIN {
            sub += INVALID_SIZE;
        }
        if let Some(new_char) = char::from_u32(sub) {
            Char(new_char)
        } else {
            panic!("attempt to sub with overflow");
        }
    }
}

impl AddAssign for Char {
    fn add_assign(&mut self, rhs: Self) {
        self.0 = (*self + rhs).0;
    }
}

impl Bounded for Char {
    fn min_value() -> Self {
        Char('\0')
    }

    fn max_value() -> Self {
        Char(char::MAX)
    }

    fn one() -> Self {
        Char('\u{1}')
    }
}

In order to handle addition and subtraction of char, I implemented logic to "jump" over the forbidden character ranges.

The Bounded trait is required by irange to define the minimum and maximum values, as well as the "stepping behavior" when traversing values of a type.

Converting to Valid Regular Expression Character Classes

Regular expressions support various ways to represent character classes:

Converting a character range to a corresponding regular expression class is straightforward, but handling the more complex cases (such as predefined and named classes) requires identifying which class the range belongs to.

Fortunately, I didn’t need to hardcode these mappings. By using ucd-generate, I could generate all the necessary ranges for each character class:

ucd-generate general-category /tmp/ucd-16.0.0 --chars --exclude surrogate > src/tokens/unicode/general_category.rs
ucd-generate general-category /tmp/ucd-16.0.0 --chars --include decimalnumber > src/tokens/unicode/perl_decimal.rs
ucd-generate property-bool /tmp/ucd-16.0.0 --chars --include whitespace > src/tokens/unicode/perl_space.rs
ucd-generate perl-word /tmp/ucd-16.0.0 --chars > src/tokens/unicode/perl_word.rs
ucd-generate property-bool /tmp/ucd-16.0.0 --chars > src/tokens/unicode/property_bool.rs
ucd-generate script /tmp/ucd-16.0.0 --chars > src/tokens/unicode/script.rs

By comparing character ranges with the generated Unicode classes, I was able to easily map ranges to their corresponding valid regular expression classes.

Example

Here is an example of what you can do with regex-charclass:

use regex_charclass::{irange::{RangeSet, range::AnyRange}, char::Char, CharacterClass};

let range1 = RangeSet::new_from_range_char('a'..='z');
assert_eq!(26, range1.get_cardinality());
assert_eq!("[a-z]", range1.to_regex());

let range2 = RangeSet::new_from_ranges(&[
    AnyRange::from(Char::new('0')..=Char::new('9')),
    AnyRange::from(Char::new('A')..=Char::new('F')),
    AnyRange::from(Char::new('a')..=Char::new('f')),
]);
assert_eq!("\\p{ASCII_Hex_Digit}", range2.to_regex());

let range2_complement = range2.complement();
assert_eq!("\\P{ASCII_Hex_Digit}", range2_complement.to_regex());

assert_eq!(".", range2.union(&range2_complement).to_regex());
assert_eq!("[]", range2.intersection(&range2_complement).to_regex());

assert_eq!("[g-z]", range1.difference(&range2).to_regex());

Get Started

If you’re interested in using regex-charclass in your project, you can find it on crates.io and check out the full documentation here.

Contribute

regex-charclass is an open-source library, and I welcome contributions! If you’d like to help out, visit the GitHub repository. I look forward to hearing your feedback and seeing your contributions.