regexクレートを歩く

概要

regexクレートはRustで書かれた正規表現ライブラリで高速なgrepとして有名なripgrepなどでも使用されている。今回はregexクレートを使用してRegex構造体を作成したときに内部ではどのようなことが行われているのかを見ていったのだが、色々こんがらがって来たのでここにメモを残すこととした。詳細にはあまり立ち入れていないがそれは今後の課題としたい。

以下はRust1.31で正規表現を使用する場合のサンプルコード。なおregexクレートのバージョンはv1.1.0である。

use regex::Regex;
fn main() {
    let re = Regex::new("HappyNewYear!*").unwrap();
}

このコードをエントリポイントとして、ソースコードを読んでいく。

Regexのbuild

サンプルコードで生成しているRegex構造体はregexクレートのsrc/re_unicode.rs内にて以下のように定義されている。

#[derive(Clone)]
pub struct Regex(pub Exec);

/// Core regular expression methods.
impl Regex {
    /// Compiles a regular expression. Once compiled, it can be used repeatedly
    /// to search, split or replace text in a string.
    ///
    /// If an invalid expression is given, then an error is returned.
    pub fn new(re: &str) -> Result<Regex, Error> {
        RegexBuilder::new(re).build()
    }
    /// 省略
}

Regex構造体はExec構造体をフィールドとして持つ。

pub struct Exec {
    /// All read only state.
    ro: Arc<ExecReadOnly>,
    /// Caches for the various matching engines.
    cache: CachedThreadLocal<ProgramCache>,
}

Exec構造体はフィールドとしてマッチングに必要な情報が格納されたExecReadOnly構造体と、マッチング時にキャッシュとして使用するための構造体を持っている。

ExecReadOnly構造体は以下のように定義されている。

#[derive(Debug)]
struct ExecReadOnly {
    /// The original regular expressions given by the caller to compile.
    res: Vec<String>,
    /// A compiled program that is used in the NFA simulation and backtracking.
    /// It can be byte-based or Unicode codepoint based.
    ///
    /// N.B. It is not possibly to make this byte-based from the public API.
    /// It is only used for testing byte based programs in the NFA simulations.
    nfa: Program,
    /// A compiled byte based program for DFA execution. This is only used
    /// if a DFA can be executed. (Currently, only word boundary assertions are
    /// not supported.) Note that this program contains an embedded `.*?`
    /// preceding the first capture group, unless the regex is anchored at the
    /// beginning.
    dfa: Program,
    /// The same as above, except the program is reversed (and there is no
    /// preceding `.*?`). This is used by the DFA to find the starting location
    /// of matches.
    dfa_reverse: Program,
    /// A set of suffix literals extracted from the regex.
    ///
    /// Prefix literals are stored on the `Program`, since they are used inside
    /// the matching engines.
    suffixes: LiteralSearcher,
    /// match_type encodes as much upfront knowledge about how we're going to
    /// execute a search as possible.
    match_type: MatchType,
}

ExecReadOnly構造体にはオリジナルの正規表現やnfaなどマッチング時に使用するVMコードと思しきフィールドが格納されているようである。ここではひとまず、引き続き処理を追う。

Regex::new()を呼ぶと内部でRegexBuilderに引数の正規表現が渡り、build()を呼び出している。

/// A configurable builder for a regular expression.
///
/// A builder can be used to configure how the regex is built, for example, by
/// setting the default flags (which can be overridden in the expression
/// itself) or setting various limits.
pub struct RegexBuilder(RegexOptions);

impl RegexBuilder {
    /// Create a new regular expression builder with the given pattern.
    ///
    /// If the pattern is invalid, then an error will be returned when
    /// `build` is called.
    pub fn new(pattern: &str) -> RegexBuilder {
        let mut builder = RegexBuilder(RegexOptions::default());
        builder.0.pats.push(pattern.to_owned());
        builder
    }

    /// Consume the builder and compile the regular expression.
    ///
    /// Note that calling `as_str` on the resulting `Regex` will produce the
    /// pattern given to `new` verbatim. Notably, it will not incorporate any
    /// of the flags set on this builder.
    pub fn build(&self) -> Result<Regex, Error> {
        ExecBuilder::new_options(self.0.clone())
            .only_utf8($only_utf8)
            .build()
            .map(Regex::from)
    }
    /// 省略
}

RegexBuilderはnewされるとRegexOptionsの設定を行っている。RegexOptionsでは複数行に対してマッチを行うかどうかや、任意の文字にマッチする「.」を改行にマッチさせるかなどを設定することができる。そしてbuild()を呼ぶとExecBuilderにRegexOptionsが引き渡されExec構造体が作成される。

Execのbuild

ExecBuilderのbuildメソッドは以下のようになっている。

impl ExecBuilder {
    /// Build an executor that can run a regular expression.
    pub fn build(self) -> Result<Exec, Error> {
        // Special case when we have no patterns to compile.
        // This can happen when compiling a regex set.
        if self.options.pats.is_empty() {
            let ro = Arc::new(ExecReadOnly {
                res: vec![],
                nfa: Program::new(),
                dfa: Program::new(),
                dfa_reverse: Program::new(),
                suffixes: LiteralSearcher::empty(),
                match_type: MatchType::Nothing,
            });
            return Ok(Exec {
                ro: ro,
                cache: CachedThreadLocal::new(),
            });
        }
        let parsed = self.parse()?;
        let mut nfa = Compiler::new()
            .size_limit(self.options.size_limit)
            .bytes(self.bytes || parsed.bytes)
            .only_utf8(self.only_utf8)
            .compile(&parsed.exprs)?;
        let mut dfa = Compiler::new()
            .size_limit(self.options.size_limit)
            .dfa(true)
            .only_utf8(self.only_utf8)
            .compile(&parsed.exprs)?;
        let mut dfa_reverse = Compiler::new()
            .size_limit(self.options.size_limit)
            .dfa(true)
            .only_utf8(self.only_utf8)
            .reverse(true)
            .compile(&parsed.exprs)?;

        let prefixes = parsed.prefixes.unambiguous_prefixes();
        let suffixes = parsed.suffixes.unambiguous_suffixes();
        nfa.prefixes = LiteralSearcher::prefixes(prefixes);
        dfa.prefixes = nfa.prefixes.clone();
        dfa.dfa_size_limit = self.options.dfa_size_limit;
        dfa_reverse.dfa_size_limit = self.options.dfa_size_limit;

        let mut ro = ExecReadOnly {
            res: self.options.pats,
            nfa: nfa,
            dfa: dfa,
            dfa_reverse: dfa_reverse,
            suffixes: LiteralSearcher::suffixes(suffixes),
            match_type: MatchType::Nothing,
        };
        ro.match_type = ro.choose_match_type(self.match_type);

        let ro = Arc::new(ro);
        Ok(Exec {
            ro: ro,
            cache: CachedThreadLocal::new(),
        })
    }
}

少し長いので複数に分けて上の方から順に見て行く。

正規表現パターンが与えられなかった場合の処理

        // Special case when we have no patterns to compile.
        // This can happen when compiling a regex set.
        if self.options.pats.is_empty() {
            let ro = Arc::new(ExecReadOnly {
                res: vec![],
                nfa: Program::new(),
                dfa: Program::new(),
                dfa_reverse: Program::new(),
                suffixes: LiteralSearcher::empty(),
                match_type: MatchType::Nothing,
            });
            return Ok(Exec {
                ro: ro,
                cache: CachedThreadLocal::new(),
            });
        }

ここは正規表現のパターンが与えられなかった場合の処理。 regexクレートにはRegexSetを使うことで複数の正規表現に対するマッチ結果を同時に得ることができるという機能があり、このときに正規表現が１つも与えられなかったときにこのifの内側に入るようだ。 ifの内側ではmatch_typeにどんな入力にもマッチしないことを表すMatchType::Nothingが入れられている。

正規表現のパースとコンパイル

        let parsed = self.parse()?;
        let mut nfa = Compiler::new()
            .size_limit(self.options.size_limit)
            .bytes(self.bytes || parsed.bytes)
            .only_utf8(self.only_utf8)
            .compile(&parsed.exprs)?;
        let mut dfa = Compiler::new()
            .size_limit(self.options.size_limit)
            .dfa(true)
            .only_utf8(self.only_utf8)
            .compile(&parsed.exprs)?;
        let mut dfa_reverse = Compiler::new()
            .size_limit(self.options.size_limit)
            .dfa(true)
            .only_utf8(self.only_utf8)
            .reverse(true)
            .compile(&parsed.exprs)?;

regexクレートでは正規表現をNFA, DFA, そして逆向きのDFAに事前にコンパイルして用意しているようだ。 regexクレートのマッチングは基本的にDFAで実行可能な場合はDFAで実行するようになっているが、キャプチャが必要な場合などにはnfaも使うような実装になっている。逆向きのDFAはDFAだけでマッチングを行ったときにマッチの開始位置を得るときなどに使われている。

Prefix、Suffixの取得とDFAサイズの上限の設定

        let prefixes = parsed.prefixes.unambiguous_prefixes();
        let suffixes = parsed.suffixes.unambiguous_suffixes();
        nfa.prefixes = LiteralSearcher::prefixes(prefixes);
        dfa.prefixes = nfa.prefixes.clone();
        dfa.dfa_size_limit = self.options.dfa_size_limit;
        dfa_reverse.dfa_size_limit = self.options.dfa_size_limit;

正規表現によるマッチングを高速化する手法として、正規表現の中に含まれる固定文字列をBoyer-Moore法などの高速検索手法を利用して高速化する手法がある。上記のコードにおけるprefixesとsuffixesはそれぞれ正規表現の先頭と末尾の固定文字列でありregexクレートではこれらの固定文字列を利用してマッチングの高速化を図っている。

dfa_size_limitは文字通りDFAのサイズ上限を表している。regexクレートではDFAマッチングにおいて、マッチングを行いながらDFAを構築していくonline DFAという方式を取っている。このとき一度構築された状態はキャッシュされ、再度利用される際に再構築する手間を省くようになっている。しかし、キャッシュされた状態の数がdfa_size_limitを超えた場合はキャッシュをリフレッシュするようになっている。また、キャッシュのリフレッシュが頻繁に発生する場合にはマッチングがNFAに切り替わるようになっている。

マッチタイプの選択

        let mut ro = ExecReadOnly {
            res: self.options.pats,
            nfa: nfa,
            dfa: dfa,
            dfa_reverse: dfa_reverse,
            suffixes: LiteralSearcher::suffixes(suffixes),
            match_type: MatchType::Nothing,
        };
        ro.match_type = ro.choose_match_type(self.match_type);

ここではマッチタイプの選択を行っている。マッチタイプの選択はchoose_match_typeというメソッドで行われているのでchoose_match_typeの内容を見てみる。

impl ExecReadOnly {
    fn choose_match_type(&self, hint: Option<MatchType>) -> MatchType {
        use self::MatchType::*;
        if let Some(Nfa(_)) = hint {
            return hint.unwrap();
        }
        // If the NFA is empty, then we'll never match anything.
        if self.nfa.insts.is_empty() {
            return Nothing;
        }
        // If our set of prefixes is complete, then we can use it to find
        // a match in lieu of a regex engine. This doesn't quite work well in
        // the presence of multiple regexes, so only do it when there's one.
        //
        // TODO(burntsushi): Also, don't try to match literals if the regex is
        // partially anchored. We could technically do it, but we'd need to
        // create two sets of literals: all of them and then the subset that
        // aren't anchored. We would then only search for all of them when at
        // the beginning of the input and use the subset in all other cases.
        if self.res.len() == 1 {
            if self.nfa.prefixes.complete() {
                return if self.nfa.is_anchored_start {
                    Literal(MatchLiteralType::AnchoredStart)
                } else {
                    Literal(MatchLiteralType::Unanchored)
                };
            }
            if self.suffixes.complete() {
                return if self.nfa.is_anchored_end {
                    Literal(MatchLiteralType::AnchoredEnd)
                } else {
                    // This case shouldn't happen. When the regex isn't
                    // anchored, then complete prefixes should imply complete
                    // suffixes.
                    Literal(MatchLiteralType::Unanchored)
                };
            }
        }
        // If we can execute the DFA, then we totally should.
        if dfa::can_exec(&self.dfa) {
            // Regex sets require a slightly specialized path.
            if self.res.len() >= 2 {
                return DfaMany;
            }
            // If the regex is anchored at the end but not the start, then
            // just match in reverse from the end of the haystack.
            if !self.nfa.is_anchored_start && self.nfa.is_anchored_end {
                return DfaAnchoredReverse;
            }
            // If there's a longish suffix literal, then it might be faster
            // to look for that first.
            if self.should_suffix_scan() {
                return DfaSuffix;
            }
            // Fall back to your garden variety forward searching lazy DFA.
            return Dfa;
        }
        // We're so totally hosed.
        Nfa(MatchNfaType::Auto)
    }
  }
}

これも長いので少しずつ見ていく。

マッチングタイプが指定されている場合と、NFAがからの場合

        use self::MatchType::*;
        if let Some(Nfa(_)) = hint {
            return hint.unwrap();
        }
        // If the NFA is empty, then we'll never match anything.
        if self.nfa.insts.is_empty() {
            return Nothing;
        }

hintとしてmatch_typeの指定(NFAのみ)があればそれを返し、nfaがからの場合はNothingを返すようになっている。

固定文字列検索だけでマッチングが行える場合?

        // If our set of prefixes is complete, then we can use it to find
        // a match in lieu of a regex engine. This doesn't quite work well in
        // the presence of multiple regexes, so only do it when there's one.
        //
        // TODO(burntsushi): Also, don't try to match literals if the regex is
        // partially anchored. We could technically do it, but we'd need to
        // create two sets of literals: all of them and then the subset that
        // aren't anchored. We would then only search for all of them when at
        // the beginning of the input and use the subset in all other cases.
        if self.res.len() == 1 {
            if self.nfa.prefixes.complete() {
                return if self.nfa.is_anchored_start {
                    Literal(MatchLiteralType::AnchoredStart)
                } else {
                    Literal(MatchLiteralType::Unanchored)
                };
            }
            if self.suffixes.complete() {
                return if self.nfa.is_anchored_end {
                    Literal(MatchLiteralType::AnchoredEnd)
                } else {
                    // This case shouldn't happen. When the regex isn't
                    // anchored, then complete prefixes should imply complete
                    // suffixes.
                    Literal(MatchLiteralType::Unanchored)
                };
            }
        }

ここらへんに関する処理はまだちょっと読めてない。与えられた正規表現が固定文字列検索だけでマッチングできる場合は正規表現エンジンを使わないで固定文字検索手法を使うということだと思われる。

DFAが使える場合

        // If we can execute the DFA, then we totally should.
        if dfa::can_exec(&self.dfa) {
            // Regex sets require a slightly specialized path.
            if self.res.len() >= 2 {
                return DfaMany;
            }
            // If the regex is anchored at the end but not the start, then
            // just match in reverse from the end of the haystack.
            if !self.nfa.is_anchored_start && self.nfa.is_anchored_end {
                return DfaAnchoredReverse;
            }
            // If there's a longish suffix literal, then it might be faster
            // to look for that first.
            if self.should_suffix_scan() {
                return DfaSuffix;
            }
            // Fall back to your garden variety forward searching lazy DFA.
            return Dfa;
        }

ここではDFAが実行できるかを確認している。suffixが長い場合は固定文字列検索でsuffixを見つけたあと逆向きDFAでマッチングを行うようになっている。 can_execでは元になるNFAの状態数が多すぎないかとDFAでは使用できないUnicode命令が含まれていないかを確認している。

全部無理な場合

        // We're so totally hosed.
        Nfa(MatchNfaType::Auto)

上記のマッチタイプがすべて使用できない場合はNFAが使用される。regexクレートのNFAエンジンにはバックトラッキング型とPikeのVM型の2つが存在していて、基本的にはバックトラッキング型のほうが性能が良いので使用されるが、NFAの状態数が多い場合にはPikeのVM型が使用される。バックトラッキング型にはバックトラックによってマッチング時間が指数関数的に増加する問題があることが知られているが、regexクレートでは一度通った状態をキャッシュして再度通った場合にはそのスレッドのマッチングを中止するようにすることで、メモリ効率と引き換えに線形時間でのマッチングが保証されている。これによってNFAのサイズが小さい場合には高速にマッチングが行えるようだ。

まとめ

regexクレートでは複数のエンジンを駆使して効率の良いマッチングを行えるようにしているということが分かった。固定文字列関係はもう少し調べる必要がありそう。後、各正規表現エンジンの実装もちゃんと読んでいきたい。

概要