Developer function to match regex, fixed or glob patterns against token types. This allows C++ function to perform fast searches in tokens object. C++ functions use a list of type IDs to construct a hash table, against which sub-vectors of tokens object are matched. This function constructs an index of glob patterns for faster matching.
pattern2fixed
converts regex and glob patterns to fixed patterns.
index_types
is an auxiliary function for pattern2id
that
constructs an index of "glob" or "fixed" patterns to avoid expensive
sequential search. For example, a type "cars" is index by keys "cars",
"car?", "c*", "ca*", "car*" and "cars*" when valuetype="glob"
.
pattern2id(
pattern,
types,
valuetype = c("glob", "fixed", "regex"),
case_insensitive = TRUE,
keep_nomatch = FALSE
)
pattern2fixed(
pattern,
types,
valuetype = c("glob", "fixed", "regex"),
case_insensitive = TRUE,
keep_nomatch = FALSE
)
index_types(types, valuetype, case_insensitive, max_len = NULL)
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
token types against which patterns are matched
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching a
pattern
or dictionary values
keep patterns that did not match
maximum length of types to be indexed
a list of integer vectors containing indices of matched types
pattern2fixed
returns a list of character vectors containing
types
index_types
returns a list of integer vectors containing type
IDs with index keys as an attribute
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")
pats_regex <- list(c("^a$", "^b"), c("c"), c("d"))
pattern2id(pats_regex, types, "regex", case_insensitive = TRUE)
#> [[1]]
#> [1] 1 3
#>
#> [[2]]
#> [1] 1 4
#>
#> [[3]]
#> [1] 1 5
#>
#> [[4]]
#> [1] 6
#>
#> [[5]]
#> [1] 7
#>
pats_glob <- list(c("a*", "b*"), c("c"), c("d"))
pattern2id(pats_glob, types, "glob", case_insensitive = TRUE)
#> [[1]]
#> [1] 1 3
#>
#> [[2]]
#> [1] 2 3
#>
#> [[3]]
#> [1] 1 4
#>
#> [[4]]
#> [1] 2 4
#>
#> [[5]]
#> [1] 1 5
#>
#> [[6]]
#> [1] 2 5
#>
#> [[7]]
#> [1] 6
#>
pattern <- list(c("^a$", "^b"), c("c"), c("d"))
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")
pattern2fixed(pattern, types, "regex", case_insensitive = TRUE)
#> [[1]]
#> [1] "A" "B"
#>
#> [[2]]
#> [1] "A" "BB"
#>
#> [[3]]
#> [1] "A" "BBB"
#>
#> [[4]]
#> [1] "C"
#>
#> [[5]]
#> [1] "CC"
#>
index <- index_types(c("xxx", "yyyy", "ZZZ"), "glob", FALSE, 3)
quanteda:::search_glob("yy*", attr(index, "type_search"), index)
#> [1] 2