しばイヌとお茶

regexクレートを歩く(固定文字検索編)

Published on: 2019-03-26

前回の記事はこちら

固定文字列検索による最適化

固定文字列とはただの文字列のことです。例えば、「Cat」や「Dog」などは固定文字列です。正規表現であれば　例えば (Apple|Banana)* というような正規表現によって「Apple」や「AppleBanana」など複数の文字列を表すことができますが、固定文字列は単一の文字列のみを表します。一般に正規表現マッチングよりも固定文字列検索のほうが効率よく行うことができます。そこで、現代の正規表現エンジンではマッチングの高速化を図るために固定文字列検索手法が取り入れられているものもあります。具体的には入力として与えられた正規表現の先頭や末尾から固定文字列を括りだし、正規表現マッチングを行う開始位置を、括りだした固定文字列を検索することによって見つけ出します。

例えば http://([^/?#]*)?([?#]*)(\?([^#]*))?(#(.*))?というような正規表現を使ってマッチングを行う場合にはhttp://の部分を固定文字列として括りだし、固定文字列検索手法を用いてマッチング対象中からhttp://の位置を見つけ出し、その位置から正規表現マッチングを行ないます。

regexクレートにおける固定文字列検索

さて本題ですが、Rustの正規表現ライブラリであるregexでも固定文字列検索手法が取り入れられてます。今回はregexクレート内で使用されている固定文字列検索手法についてまとめていこうと思います。ただ、今回はそれぞれの手法がどのようなときに使われるかを中心に見ていくので、各手法の実装の詳細については立ち入りません。

固定文字列検索についてのコードはregexクレートのsrc/literals内に配置されてあります。使用される固定文字列マッチャは以下のようになっています。

#[derive(Clone, Debug)]
enum Matcher {
    /// No literals. (Never advances through the input.)
    Empty,
    /// A set of four or more single byte literals.
    Bytes(SingleByteSet),
    /// A single substring, find using memchr and frequency analysis.
    FreqyPacked(FreqyPacked),
    /// A single substring, find using Boyer-Moore.
    BoyerMoore(BoyerMooreSearch),
    /// An Aho-Corasick automaton.
    AC(FullAcAutomaton<Literal>),
    /// A simd accelerated multiple string matcher. Used only for a small
    /// number of small literals.
    TeddySSSE3(TeddySSSE3),
    /// A simd accelerated multiple string matcher. Used only for a small
    /// number of small literals. This uses 256-bit vectors.
    TeddyAVX2(TeddyAVX2),
}

Matcherの選択はMatcher::new内で行われます。また、以後はprefixの場合について記述していますが、suffixの場合も基本的に同様です。

Empty

Emptyは固定文字列検索を行わない場合に選択されます。

        if lits.literals().is_empty() {
            return Matcher::Empty;
        }
        if sset.dense.len() >= 26 {
            // Avoid trying to match a large number of single bytes.
            // This is *very* sensitive to a frequency analysis comparison
            // between the bytes in sset and the composition of the haystack.
            // No matter the size of sset, if its members all are rare in the
            // haystack, then it'd be worth using it. How to tune this... IDK.
            // ---AG
            return Matcher::Empty;
        }

ssetはprefixの先頭1文字(suffixの場合は末尾の1文字)として現れる文字の種類数です。regexクレートでは先頭1文字目に現れる文字が26種類以上の場合は固定文字列検索を行わないようです。固定文字列検索は基本的にマッチすることのない文字列をスキップすることで効率よく文字列を走査します。しかし、パターンに含まれる文字種が多いとスキップできる文字が少なくなるので効率が悪くなります。

ただ、文字種が多くても、それらの各文字が検索対象の文字列中において十分低い出現頻度であれば効率よくスキップを行えるようです。(ソースコード内のコメントにはそこらへんのうまい調整の仕方は分からないみたいなことが書いてあるんだと思います)。

Bytes

Bytesはprefixの各パターンの最大長が1の場合に用いられます。

        if sset.complete {
            return Matcher::Bytes(sset);
        }

Bytesのマッチングにはrust-memchrが使われています。rust-memchrはRustによって実装された高速なバイト文字検索ライブラリです。内部でSIMDを使用することで高速な検索を行います。

FreqyPacked

prefixのパターンが1つだけの場合にはBoyerMooreか、FreqyPackedが使用されます。 prefixが(Dog|Cat) というようなパターンで表されている場合はprefixになりうる文字列が「Dog」もしくは「Cat」というように複数になるのこれらのマッチャは使われません。prefixが複数の場合のマッチャについては後述します。

        if lits.literals().len() == 1 {
            let lit = lits.literals()[0].to_vec();
            if BoyerMooreSearch::should_use(lit.as_slice()) {
                return Matcher::BoyerMoore(BoyerMooreSearch::new(lit));
            } else {
                return Matcher::FreqyPacked(FreqyPacked::new(lit));
            }
        }

まずはFreqyPackedについて説明します。 FreqyPackedは文字の頻度解析を利用して高速なマッチングを行います。

    fn new(pat: Vec<u8>) -> FreqyPacked {
        if pat.is_empty() {
            return FreqyPacked::empty();
        }

        // Find the rarest two bytes. Try to make them distinct (but it's not
        // required).
        let mut rare1 = pat[0];
        let mut rare2 = pat[0];
        for b in pat[1..].iter().cloned() {
            if freq_rank(b) < freq_rank(rare1) {
                rare1 = b;
            }
        }
        for &b in &pat {
            if rare1 == rare2 {
                rare2 = b
            } else if b != rare1 && freq_rank(b) < freq_rank(rare2) {
                rare2 = b;
            }
        }

        // And find the offsets of their last occurrences.
        let rare1i = pat.iter().rposition(|&b| b == rare1).unwrap();
        let rare2i = pat.iter().rposition(|&b| b == rare2).unwrap();

        let char_len = char_len_lossy(&pat);
        FreqyPacked {
            pat: pat,
            char_len: char_len,
            rare1: rare1,
            rare1i: rare1i,
            rare2: rare2,
            rare2i: rare2i,
        }

上のコード上のrare1はrare2は与えられたprefixパターンに現れる文字の中で、事前の統計結果において最も出現頻度の低い文字2つを表します。この出現頻度の低い文字を先程同様SIMDを使用して見つけ出すことでマッチングの高速化を図っています。頻度が低ければスキップできる文字が多くなるので効率が上がるという寸法です。頻度解析はscripts/frequencies.pyで行われ、解析結果はsrc/freq.rs 内に埋め込まれています。以下のコーパスが頻度解析に使われたようです。

ザ・ワールド・ファクトブック
rustcのソースコード
七十人訳聖書

ザ・ワールド・ファクトブックはCIAが毎年出している世界中の情報をまとめた本らしいです。あと、聖書が使われているってなんかすごいですね…。

日本語に対してマッチングを行う場合は日本語のコーパスを使ったりしたほうが当然マッチング効率良くなりそうな気がしますが、その辺はどうなんですかね。

BoyerMoore

BoyerMooreは基本的にはFreqyPackedよりも遅いらしいです。それだけSIMDが強力ということなのだと思います。ただ、いくつかのケースではBoyerMooreの方が速くなることもあるようです。

検索対象文字列がランダムな場合

この場合、各文字の出現頻度は同じくらいになるので頻度解析による効率化は望めません。しかし、事前に検索対象文字列がランダムであることを知ることはできないとコメントに書いてあったので、このケースについては特に対処されていないようです。

prefixのパターンが高出現頻度の文字で構成されている場合

この場合も同様に頻度解析を使った効率化は望めません。また、このケースはprefixのパターンを調べることで事前に知ることができるので最適化を施すことが容易です。

prefixのパターンが長い場合

SIMDは一度に扱えるデータの量に制限があるのでprefixが長い場合にはBoyerMooreによるスキップの効率の方が良くなってSIMDを上回ることができそうです。

これらのケースを考慮してregexクレートではprefixのパターンが長ければ長いほどBoyerMooreが使われやすくなるようになっています。BoyerMoore法を使うかどうかを判定する部分の詳しい実装はここ。

Teddy

ここからはprefixのパターンが複数ある場合に用いられる手法です。 TeddyはHyperscanで用いられている複数文字列マッチングアルゴリズムでSIMDを使って高速なマッチングを行えるようです。アルゴリズムについての詳細はsrc/literal/teddy_sse3/imp.rsに詳しく記されています。今回調べていて、この辺りの話に興味が出てきたので、このアルゴリズムについてもいずれ、また別の記事としてまとめるかもしれないです。

Aho-Corasick

Teddyを使うにはSSE3、もしくはAVX2が必要です。それらが使えない場合やprefixの先頭1文字に現れうる文字が1種類の場合にはAho-Corasickが使われます。Aho-Corasickは伝統的な複数文字列マッチング手法として知られている手法で特殊なオートマトンを用いてマッチングを行うことでパターンの数とマッチング対象テキストの長さに対して線形時間でマッチングが行なえます。

まとめ

今回はregexクレートで用いられている固定文字列検索手法についてまとめました。自分の当初の予想では学校で習ったBoyer-Moore法のようなアルゴリズムがメインで使われているのかなと思っていましたが、実際はSIMDがメインで使われていて、ハードウェアレベルの最適化強いな〜という感じでした。この論文辺りにそこらへんのことが色々書いてそうなのでこれから読んでいきたいと思います。

Comments

regexクレートを歩く

Published on: 2019-01-03

Tags: Rust 正規表現

概要

regexクレートはRustで書かれた正規表現ライブラリで高速なgrepとして有名なripgrepなどでも使用されている。今回はregexクレートを使用してRegex構造体を作成したときに内部ではどのようなことが行われているのかを見ていったのだが、色々こんがらがって来たのでここにメモを残すこととした。詳細にはあまり立ち入れていないがそれは今後の課題としたい。

以下はRust1.31で正規表現を使用する場合のサンプルコード。なおregexクレートのバージョンはv1.1.0である。

use regex::Regex;
fn main() {
    let re = Regex::new("HappyNewYear!*").unwrap();
}

このコードをエントリポイントとして、ソースコードを読んでいく。

Regexのbuild

サンプルコードで生成しているRegex構造体はregexクレートのsrc/re_unicode.rs内にて以下のように定義されている。

#[derive(Clone)]
pub struct Regex(pub Exec);

/// Core regular expression methods.
impl Regex {
    /// Compiles a regular expression. Once compiled, it can be used repeatedly
    /// to search, split or replace text in a string.
    ///
    /// If an invalid expression is given, then an error is returned.
    pub fn new(re: &str) -> Result<Regex, Error> {
        RegexBuilder::new(re).build()
    }
    /// 省略
}

Regex構造体はExec構造体をフィールドとして持つ。

pub struct Exec {
    /// All read only state.
    ro: Arc<ExecReadOnly>,
    /// Caches for the various matching engines.
    cache: CachedThreadLocal<ProgramCache>,
}

Exec構造体はフィールドとしてマッチングに必要な情報が格納されたExecReadOnly構造体と、マッチング時にキャッシュとして使用するための構造体を持っている。

ExecReadOnly構造体は以下のように定義されている。

#[derive(Debug)]
struct ExecReadOnly {
    /// The original regular expressions given by the caller to compile.
    res: Vec<String>,
    /// A compiled program that is used in the NFA simulation and backtracking.
    /// It can be byte-based or Unicode codepoint based.
    ///
    /// N.B. It is not possibly to make this byte-based from the public API.
    /// It is only used for testing byte based programs in the NFA simulations.
    nfa: Program,
    /// A compiled byte based program for DFA execution. This is only used
    /// if a DFA can be executed. (Currently, only word boundary assertions are
    /// not supported.) Note that this program contains an embedded `.*?`
    /// preceding the first capture group, unless the regex is anchored at the
    /// beginning.
    dfa: Program,
    /// The same as above, except the program is reversed (and there is no
    /// preceding `.*?`). This is used by the DFA to find the starting location
    /// of matches.
    dfa_reverse: Program,
    /// A set of suffix literals extracted from the regex.
    ///
    /// Prefix literals are stored on the `Program`, since they are used inside
    /// the matching engines.
    suffixes: LiteralSearcher,
    /// match_type encodes as much upfront knowledge about how we're going to
    /// execute a search as possible.
    match_type: MatchType,
}

ExecReadOnly構造体にはオリジナルの正規表現やnfaなどマッチング時に使用するVMコードと思しきフィールドが格納されているようである。ここではひとまず、引き続き処理を追う。

Regex::new()を呼ぶと内部でRegexBuilderに引数の正規表現が渡り、build()を呼び出している。

/// A configurable builder for a regular expression.
///
/// A builder can be used to configure how the regex is built, for example, by
/// setting the default flags (which can be overridden in the expression
/// itself) or setting various limits.
pub struct RegexBuilder(RegexOptions);

impl RegexBuilder {
    /// Create a new regular expression builder with the given pattern.
    ///
    /// If the pattern is invalid, then an error will be returned when
    /// `build` is called.
    pub fn new(pattern: &str) -> RegexBuilder {
        let mut builder = RegexBuilder(RegexOptions::default());
        builder.0.pats.push(pattern.to_owned());
        builder
    }

    /// Consume the builder and compile the regular expression.
    ///
    /// Note that calling `as_str` on the resulting `Regex` will produce the
    /// pattern given to `new` verbatim. Notably, it will not incorporate any
    /// of the flags set on this builder.
    pub fn build(&self) -> Result<Regex, Error> {
        ExecBuilder::new_options(self.0.clone())
            .only_utf8($only_utf8)
            .build()
            .map(Regex::from)
    }
    /// 省略
}

RegexBuilderはnewされるとRegexOptionsの設定を行っている。RegexOptionsでは複数行に対してマッチを行うかどうかや、任意の文字にマッチする「.」を改行にマッチさせるかなどを設定することができる。そしてbuild()を呼ぶとExecBuilderにRegexOptionsが引き渡されExec構造体が作成される。

Execのbuild

ExecBuilderのbuildメソッドは以下のようになっている。

impl ExecBuilder {
    /// Build an executor that can run a regular expression.
    pub fn build(self) -> Result<Exec, Error> {
        // Special case when we have no patterns to compile.
        // This can happen when compiling a regex set.
        if self.options.pats.is_empty() {
            let ro = Arc::new(ExecReadOnly {
                res: vec![],
                nfa: Program::new(),
                dfa: Program::new(),
                dfa_reverse: Program::new(),
                suffixes: LiteralSearcher::empty(),
                match_type: MatchType::Nothing,
            });
            return Ok(Exec {
                ro: ro,
                cache: CachedThreadLocal::new(),
            });
        }
        let parsed = self.parse()?;
        let mut nfa = Compiler::new()
            .size_limit(self.options.size_limit)
            .bytes(self.bytes || parsed.bytes)
            .only_utf8(self.only_utf8)
            .compile(&parsed.exprs)?;
        let mut dfa = Compiler::new()
            .size_limit(self.options.size_limit)
            .dfa(true)
            .only_utf8(self.only_utf8)
            .compile(&parsed.exprs)?;
        let mut dfa_reverse = Compiler::new()
            .size_limit(self.options.size_limit)
            .dfa(true)
            .only_utf8(self.only_utf8)
            .reverse(true)
            .compile(&parsed.exprs)?;

        let prefixes = parsed.prefixes.unambiguous_prefixes();
        let suffixes = parsed.suffixes.unambiguous_suffixes();
        nfa.prefixes = LiteralSearcher::prefixes(prefixes);
        dfa.prefixes = nfa.prefixes.clone();
        dfa.dfa_size_limit = self.options.dfa_size_limit;
        dfa_reverse.dfa_size_limit = self.options.dfa_size_limit;

        let mut ro = ExecReadOnly {
            res: self.options.pats,
            nfa: nfa,
            dfa: dfa,
            dfa_reverse: dfa_reverse,
            suffixes: LiteralSearcher::suffixes(suffixes),
            match_type: MatchType::Nothing,
        };
        ro.match_type = ro.choose_match_type(self.match_type);

        let ro = Arc::new(ro);
        Ok(Exec {
            ro: ro,
            cache: CachedThreadLocal::new(),
        })
    }
}

少し長いので複数に分けて上の方から順に見て行く。

正規表現パターンが与えられなかった場合の処理

        // Special case when we have no patterns to compile.
        // This can happen when compiling a regex set.
        if self.options.pats.is_empty() {
            let ro = Arc::new(ExecReadOnly {
                res: vec![],
                nfa: Program::new(),
                dfa: Program::new(),
                dfa_reverse: Program::new(),
                suffixes: LiteralSearcher::empty(),
                match_type: MatchType::Nothing,
            });
            return Ok(Exec {
                ro: ro,
                cache: CachedThreadLocal::new(),
            });
        }

ここは正規表現のパターンが与えられなかった場合の処理。 regexクレートにはRegexSetを使うことで複数の正規表現に対するマッチ結果を同時に得ることができるという機能があり、このときに正規表現が１つも与えられなかったときにこのifの内側に入るようだ。 ifの内側ではmatch_typeにどんな入力にもマッチしないことを表すMatchType::Nothingが入れられている。

正規表現のパースとコンパイル

        let parsed = self.parse()?;
        let mut nfa = Compiler::new()
            .size_limit(self.options.size_limit)
            .bytes(self.bytes || parsed.bytes)
            .only_utf8(self.only_utf8)
            .compile(&parsed.exprs)?;
        let mut dfa = Compiler::new()
            .size_limit(self.options.size_limit)
            .dfa(true)
            .only_utf8(self.only_utf8)
            .compile(&parsed.exprs)?;
        let mut dfa_reverse = Compiler::new()
            .size_limit(self.options.size_limit)
            .dfa(true)
            .only_utf8(self.only_utf8)
            .reverse(true)
            .compile(&parsed.exprs)?;

regexクレートでは正規表現をNFA, DFA, そして逆向きのDFAに事前にコンパイルして用意しているようだ。 regexクレートのマッチングは基本的にDFAで実行可能な場合はDFAで実行するようになっているが、キャプチャが必要な場合などにはnfaも使うような実装になっている。逆向きのDFAはDFAだけでマッチングを行ったときにマッチの開始位置を得るときなどに使われている。

Prefix、Suffixの取得とDFAサイズの上限の設定

        let prefixes = parsed.prefixes.unambiguous_prefixes();
        let suffixes = parsed.suffixes.unambiguous_suffixes();
        nfa.prefixes = LiteralSearcher::prefixes(prefixes);
        dfa.prefixes = nfa.prefixes.clone();
        dfa.dfa_size_limit = self.options.dfa_size_limit;
        dfa_reverse.dfa_size_limit = self.options.dfa_size_limit;

正規表現によるマッチングを高速化する手法として、正規表現の中に含まれる固定文字列をBoyer-Moore法などの高速検索手法を利用して高速化する手法がある。上記のコードにおけるprefixesとsuffixesはそれぞれ正規表現の先頭と末尾の固定文字列でありregexクレートではこれらの固定文字列を利用してマッチングの高速化を図っている。

dfa_size_limitは文字通りDFAのサイズ上限を表している。regexクレートではDFAマッチングにおいて、マッチングを行いながらDFAを構築していくonline DFAという方式を取っている。このとき一度構築された状態はキャッシュされ、再度利用される際に再構築する手間を省くようになっている。しかし、キャッシュされた状態の数がdfa_size_limitを超えた場合はキャッシュをリフレッシュするようになっている。また、キャッシュのリフレッシュが頻繁に発生する場合にはマッチングがNFAに切り替わるようになっている。

マッチタイプの選択

        let mut ro = ExecReadOnly {
            res: self.options.pats,
            nfa: nfa,
            dfa: dfa,
            dfa_reverse: dfa_reverse,
            suffixes: LiteralSearcher::suffixes(suffixes),
            match_type: MatchType::Nothing,
        };
        ro.match_type = ro.choose_match_type(self.match_type);

ここではマッチタイプの選択を行っている。マッチタイプの選択はchoose_match_typeというメソッドで行われているのでchoose_match_typeの内容を見てみる。

impl ExecReadOnly {
    fn choose_match_type(&self, hint: Option<MatchType>) -> MatchType {
        use self::MatchType::*;
        if let Some(Nfa(_)) = hint {
            return hint.unwrap();
        }
        // If the NFA is empty, then we'll never match anything.
        if self.nfa.insts.is_empty() {
            return Nothing;
        }
        // If our set of prefixes is complete, then we can use it to find
        // a match in lieu of a regex engine. This doesn't quite work well in
        // the presence of multiple regexes, so only do it when there's one.
        //
        // TODO(burntsushi): Also, don't try to match literals if the regex is
        // partially anchored. We could technically do it, but we'd need to
        // create two sets of literals: all of them and then the subset that
        // aren't anchored. We would then only search for all of them when at
        // the beginning of the input and use the subset in all other cases.
        if self.res.len() == 1 {
            if self.nfa.prefixes.complete() {
                return if self.nfa.is_anchored_start {
                    Literal(MatchLiteralType::AnchoredStart)
                } else {
                    Literal(MatchLiteralType::Unanchored)
                };
            }
            if self.suffixes.complete() {
                return if self.nfa.is_anchored_end {
                    Literal(MatchLiteralType::AnchoredEnd)
                } else {
                    // This case shouldn't happen. When the regex isn't
                    // anchored, then complete prefixes should imply complete
                    // suffixes.
                    Literal(MatchLiteralType::Unanchored)
                };
            }
        }
        // If we can execute the DFA, then we totally should.
        if dfa::can_exec(&self.dfa) {
            // Regex sets require a slightly specialized path.
            if self.res.len() >= 2 {
                return DfaMany;
            }
            // If the regex is anchored at the end but not the start, then
            // just match in reverse from the end of the haystack.
            if !self.nfa.is_anchored_start && self.nfa.is_anchored_end {
                return DfaAnchoredReverse;
            }
            // If there's a longish suffix literal, then it might be faster
            // to look for that first.
            if self.should_suffix_scan() {
                return DfaSuffix;
            }
            // Fall back to your garden variety forward searching lazy DFA.
            return Dfa;
        }
        // We're so totally hosed.
        Nfa(MatchNfaType::Auto)
    }
  }
}

これも長いので少しずつ見ていく。

マッチングタイプが指定されている場合と、NFAがからの場合

        use self::MatchType::*;
        if let Some(Nfa(_)) = hint {
            return hint.unwrap();
        }
        // If the NFA is empty, then we'll never match anything.
        if self.nfa.insts.is_empty() {
            return Nothing;
        }

hintとしてmatch_typeの指定(NFAのみ)があればそれを返し、nfaがからの場合はNothingを返すようになっている。

固定文字列検索だけでマッチングが行える場合?

        // If our set of prefixes is complete, then we can use it to find
        // a match in lieu of a regex engine. This doesn't quite work well in
        // the presence of multiple regexes, so only do it when there's one.
        //
        // TODO(burntsushi): Also, don't try to match literals if the regex is
        // partially anchored. We could technically do it, but we'd need to
        // create two sets of literals: all of them and then the subset that
        // aren't anchored. We would then only search for all of them when at
        // the beginning of the input and use the subset in all other cases.
        if self.res.len() == 1 {
            if self.nfa.prefixes.complete() {
                return if self.nfa.is_anchored_start {
                    Literal(MatchLiteralType::AnchoredStart)
                } else {
                    Literal(MatchLiteralType::Unanchored)
                };
            }
            if self.suffixes.complete() {
                return if self.nfa.is_anchored_end {
                    Literal(MatchLiteralType::AnchoredEnd)
                } else {
                    // This case shouldn't happen. When the regex isn't
                    // anchored, then complete prefixes should imply complete
                    // suffixes.
                    Literal(MatchLiteralType::Unanchored)
                };
            }
        }

ここらへんに関する処理はまだちょっと読めてない。与えられた正規表現が固定文字列検索だけでマッチングできる場合は正規表現エンジンを使わないで固定文字検索手法を使うということだと思われる。

DFAが使える場合

        // If we can execute the DFA, then we totally should.
        if dfa::can_exec(&self.dfa) {
            // Regex sets require a slightly specialized path.
            if self.res.len() >= 2 {
                return DfaMany;
            }
            // If the regex is anchored at the end but not the start, then
            // just match in reverse from the end of the haystack.
            if !self.nfa.is_anchored_start && self.nfa.is_anchored_end {
                return DfaAnchoredReverse;
            }
            // If there's a longish suffix literal, then it might be faster
            // to look for that first.
            if self.should_suffix_scan() {
                return DfaSuffix;
            }
            // Fall back to your garden variety forward searching lazy DFA.
            return Dfa;
        }

ここではDFAが実行できるかを確認している。suffixが長い場合は固定文字列検索でsuffixを見つけたあと逆向きDFAでマッチングを行うようになっている。 can_execでは元になるNFAの状態数が多すぎないかとDFAでは使用できないUnicode命令が含まれていないかを確認している。

全部無理な場合

        // We're so totally hosed.
        Nfa(MatchNfaType::Auto)

上記のマッチタイプがすべて使用できない場合はNFAが使用される。regexクレートのNFAエンジンにはバックトラッキング型とPikeのVM型の2つが存在していて、基本的にはバックトラッキング型のほうが性能が良いので使用されるが、NFAの状態数が多い場合にはPikeのVM型が使用される。バックトラッキング型にはバックトラックによってマッチング時間が指数関数的に増加する問題があることが知られているが、regexクレートでは一度通った状態をキャッシュして再度通った場合にはそのスレッドのマッチングを中止するようにすることで、メモリ効率と引き換えに線形時間でのマッチングが保証されている。これによってNFAのサイズが小さい場合には高速にマッチングが行えるようだ。

まとめ

regexクレートでは複数のエンジンを駆使して効率の良いマッチングを行えるようにしているということが分かった。固定文字列関係はもう少し調べる必要がありそう。後、各正規表現エンジンの実装もちゃんと読んでいきたい。

Comments

INVERSUScon 参加記

Published on: 2018-10-18

Tags: INVERSUS INVERSUScon

2018年10月13日に東京は中野にあるRed Bull Gaming Sphere TokyoにてINVERSUSのイベント「INVERSUScon」が開催されました。

INVERSUSとは自分の弾を発射することで敵の陣地を自分の陣地に塗り替えたりしながら戦う2Dシューティングゲームです。様々なプラットフォームで発売されているのでとりあえず買いましょう。

イベントに参加するまで

INVERSUS関連のイベントは以前Gongonさんに誘われたりしていたのですが、東京は自分の居住地から遠く離れていることもあって中々厳しいなという状況でした。しかし、そんなある日INVERSUS開発者のアカウントがこんなツイートをしました。

Trip to Japan in October? Yes! I’ll have more details on what day we will have an INVERSUS community event soon (in Tokyo)! 🇯🇵
— INVERSUS (@InversusGame) July 15, 2018

なんとINVERSUS開発者が来日してイベントに参加するとのこと。これは行くしかないと思いました。そして、しばらくしてINVERSUS JPからもイベントの正式な発表があり参加する運びとなりました。

イベント当日

イベント当日なんとか中野駅までたどり着き、駅からGoo◯leマップの案内に従って会場を目指したら、会場の建物の裏側に案内されたようですがしばらく気づかず迷ってました。

ナビが表からではなく裏側をナビしてたから場所がわからんかったぞ
— カルゴウ (@karugou) October 13, 2018

同じ目に遭った方もいたようです。アプリに頼りきってはいけない…

裏側に案内された可能性に気づいて、反対方向に回ると~~叫び声~~盛り上がっている声が聞こえてきたのですぐにこちらが正しい場所なのだとわかりました。

【飛び入り歓迎🏁】#INVERSUScon は本日15時開場です。
お気軽に中野の #RedBullGamingSphereTokyo へお越しください📣

詳細はこちら>>https://t.co/j5AtDRXoGx #INVERSUS pic.twitter.com/M5NcUZwy4A
— INVERSUS JP(日本コミュニティ) (@inversusjp) October 13, 2018

会場はこんな感じでした。かっこいい（恍惚）

会場に入ってまず最初に開発者のRyan Jukketさんと対戦し、少しお話することができました。Twitterアイコン可愛いねと言ってくれました。柴犬の可愛さに国境はない。

たのしいINVERSUS講座

INVERSUSトッププレイヤーのGongonさんによるINVERSUS講座が開講されました。トーナメント前に講座があるなんて手厚い。内容はINVERSUSのアイテムの説明など基本的なことから実践的なテクニックまでかなりボリュームのあるものとなっていました。経験的に知っていることもしっかりと整理された状態で見ることでより理解が深まったように思われました。

#INVERSUScon 世界一位、Gongonさんによるインバーサス講座 @uo6uo6 pic.twitter.com/hHrYpGhEFk
— INVERSUS JP(日本コミュニティ) (@inversusjp) October 13, 2018

講義で使用されたスライドは後日どこかに公開されるようなので気になる方はGongonさんのTwitterをチェックしてみると良いかもしれません。

トーナメント大会

今回のもう一つのメインイベントであるトーナメント大会も開催されました。優勝者と準優勝者には非売品グッズが贈呈されると発表されていたのもあって、やる気満々でした。

トーナメント表 ( より引用)

しかし、私は準決勝で負けてしまいました。大会は準決勝で私に勝ったカルゴウさんがそのまま優勝しました。最後勝てそうだっただけに悔しさが残ります。次があれば次こそは勝ちたい。

懇親会

ピザが出ました。美味しかったです。あとは他の参加者とおしゃべりしたりINVERSUSしたり、INVERSUSしたり…時間があっという間に過ぎていきました。

まとめ

この手のイベントに参加するのは初めてでしたがすごく楽しめました。オンラインでゲームするのもいいですが、オフラインでやると盛り上がりがすごくてとても臨場感がありすごかったです（語彙力）

イベントを主催してくださった関係者の皆様ありがとうございました！

Comments

CTF4b 2018 writeup

Published on: 2018-05-27

Tags: ctf writeup

約1ヶ月前に新しく研究室に加わった後輩達の勧めでCTFを始めました。その後彼らの管理しているCTFサイトの問題を解いたりしていたのですが、今回初心者向けのCTFとしてSECCON Beginners CTFが開かれるとのことで後輩達とチームを組んで出場する運びとなりました。

いわく、「解いた問題のwriteupを公開するまでがCTF」らしいのでここにwriteupを記します。

[Warmup] plain mail

与えられたパケットを調べるとパスワード付きのzipファイルとパスワードが見つかるのでzipファイルを解凍するとフラグが得られる。

ctf4b{email_with_encrypted_file}

RSA is Power

与えられた$N$を素因数分解して、$N=p\times q$ となるような $p, q$ を求めます。

p = 299681192390656691733849646142066664329
q = 324144336644773773047359441106332937713

あとは $de \equiv 1\bmod (p-1)(q-1)$ を満す $d$ を拡張ユークリッド互除法を用いて求めてやれば、 Wikipedaに載っているとおりに $C^{d} \bmod N$ とすれば復号できます。

d = 88509020092584531671107468782943602124921999287671161687233461555074737950465
C^d mod N =  0x63746634627b35696d706c655f7273345f31735f336173795f6630725f757d

得られるフラグは16進数表記なので適当に文字列に変換します。

ctf4b{5imple_rs4_1s_3asy_f0r_u}

Streaming

フラグの先頭がctf4bであることが分かっているのでそれを使って初期seedを求め、そこから次々にseedを求めながら複合していけばフラグを得られます。

こんな感じのコードを書きました。

def next(seed):
    A = 37423
    B = 61781
    C = 34607
    return (A*seed+B)%C

f = open('encrypted', 'rb')
encrypted = f.read()
decrypted = ""
d = int(hex(ord(encrypted[0])), 16) + int(hex(ord(encrypted[0+1])), 16) * 256
#init seed
seed = d ^ 25460
decrypted += chr(int(hex(d^seed)[2:4], 16)) + chr(int(hex(d^seed)[4:6], 16))
print(hex(d^seed))
for i in range(1, len(encrypted)//2):
    d = int(hex(ord(encrypted[2*i])), 16) + int(hex(ord(encrypted[2*i+1])), 16) * 256
    seed = next(seed)
    d = d^seed
    print(hex(d))
    if len(hex(d)) <= 4:
        decrypted += chr(int(hex(d)[2:4], 16))
    else:
        decrypted += chr(int(hex(d)[2:4], 16)) + chr(int(hex(d)[4:6], 16))
print(decrypted)

ctf4b{lcg-is-easily-predictable}

てけいさんえくすとりーむず

サーバにアクセスすると四則演算を解かされます。自分はえくすとりーむなてけいさんのプロではないのでプログラムにやらせました。

import socket
import re

class Netcat:
    """ Python 'netcat like' module """
    def __init__(self, ip, port):
        self.buff = ""
        self.socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.socket.connect((ip, port))

    def read(self, length = 1024):
        """ Read 1024 bytes off the socket """
        return self.socket.recv(length)

    def read_until(self, data):
        """ Read data into the buffer until we have data """
        while not data in self.buff:
            self.buff += self.socket.recv(1024)
        pos = self.buff.find(data)
        rval = self.buff[:pos + len(data)]
        self.buff = self.buff[pos + len(data):]
        return rval

    def write(self, data):
        self.socket.send(data)
        
    def close(self):
        self.socket.close()

nc = Netcat('133.242.234.140', 8690)
nc.read_until('Stage.1)')
for i in range(1, 101):
    print(nc.read_until('Stage.'+str(i)+')'))
    expr = nc.read_until('=')
    expr = expr.strip()
    m = re.match(r'(\d+)\s*(\+|-|\*|/)\s*(\d+).*', expr)
    term1 = int(m.group(1))
    op = m.group(2)
    term2 = int(m.group(3))
    ans = ""
    if op == "+":
        ans = term1 + term2
    elif op == "-":
        ans = term1 - term2
    elif op == "*":
        ans = term1 * term2
    elif op == "/":
        ans = term1 / term2
    print(expr), 
    print(ans)
    nc.write(str(ans) + '\n')
print(nc.read())

Netcatクラスの部分のコードはここのやつをそのまんま使ってます。

これを実行すると

(Stage.1)
908 * 509 = 462172
 (Stage.2)
925 * 757 = 700225
 (Stage.3)
940 - 850 = 90
 (Stage.4)
618 + 921 = 1539
 (Stage.5)
625 - 536 = 89
    .
    .
    .
 (Stage.100)
860 + 823 = 1683
Congrats.
Flag is: "ctf4b{ekusutori-mu>tekeisann>bigina-zu>2018}"

ctf4b{ekusutori-mu>tekeisann>bigina-zu>2018}

感想

初心者向けということで簡単な問題だけかと思ってたんですが、結構難しい問題もあったなという印象でした。でも、Cryptoは結構解けたので良かったです。Pwn, Reversingはｾﾞﾝｾﾞﾝﾜｶﾗﾝｶｯﾀ。

Comments