CRAN/E | AhoCorasickTrie

AhoCorasickTrie

Fast Searching for Multiple Keywords in Multiple Texts

Installation

About

Aho-Corasick is an optimal algorithm for finding many keywords in a text. It can locate all matches in a text in O(N+M) time; i.e., the time needed scales linearly with the number of keywords (N) and the size of the text (M). Compare this to the naive approach which takes O(N*M) time to loop through each pattern and scan for it in the text. This implementation builds the trie (the generic name of the data structure) and runs the search in a single function call. If you want to search multiple texts with the same trie, the function will take a list or vector of texts and return a list of matches to each text. By default, all 128 ASCII characters are allowed in both the keywords and the text. A more efficient trie is possible if the alphabet size can be reduced. For example, DNA sequences use at most 19 distinct characters and usually only 4; protein sequences use at most 26 distinct characters and usually only 20. UTF-8 (Unicode) matching is not currently supported.

github.com/chambm/AhoCorasickTrie
System requirements C++11
Bug report File report

Key Metrics

Version 0.1.2
Published 2020-09-29 1277 days ago
Needs compilation? yes
License Apache License 2.0
CRAN checks AhoCorasickTrie results

Downloads

Yesterday 11 0%
Last 7 days 52 -38%
Last 30 days 277 +7%
Last 90 days 953 +4%
Last 365 days 3.866 -23%

Maintainer

Maintainer

Matt Chambers

matt.chambers42@gmail.com

Authors

Matt Chambers

aut / cre

Tomas Petricek

aut / cph

Vanderbilt University

cph

Material

Reference manual
Package source

macOS

r-release

arm64

r-oldrel

arm64

r-release

x86_64

r-oldrel

x86_64

Windows

r-devel

x86_64

r-release

x86_64

r-oldrel

x86_64

Old Sources

AhoCorasickTrie archive

Imports

Rcpp ≥ 0.12.5

Suggests

Biostrings
microbenchmark
testthat

LinkingTo

Rcpp

Reverse Imports

customProDB
prozor