CRAN/E | tokenizers

tokenizers

Fast, Consistent Tokenization of Natural Language Text

Installation

About

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.

Citation tokenizers citation info
docs.ropensci.org/tokenizers/
github.com/ropensci/tokenizers
Bug report File report

Key Metrics

Version 0.3.0
R ≥ 3.1.3
Published 2022-12-22 491 days ago
Needs compilation? yes
License MIT
License File
CRAN checks tokenizers results

Downloads

Yesterday 1.330 +4%
Last 7 days 7.867 -12%
Last 30 days 32.666 -3%
Last 90 days 92.267 -2%
Last 365 days 393.107 -22%

Maintainer

Maintainer

Lincoln Mullen

lincoln@lincolnmullen.com

Authors

Lincoln Mullen

aut / cre

Os Keyes

ctb

Dmitriy Selivanov

ctb

Jeffrey Arnold

ctb

Kenneth Benoit

ctb

Material

README
NEWS
Reference manual
Package source

In Views

NaturalLanguageProcessing

Vignettes

Introduction to the tokenizers Package
The Text Interchange Formats and the tokenizers Package

macOS

r-release

arm64

r-oldrel

arm64

r-release

x86_64

r-oldrel

x86_64

Windows

r-devel

x86_64

r-release

x86_64

r-oldrel

x86_64

Old Sources

tokenizers archive

Depends

R ≥ 3.1.3

Imports

stringi ≥ 1.0.1
Rcpp ≥ 0.12.3
SnowballC ≥ 0.5.1

Suggests

covr
knitr
rmarkdown
stopwords ≥ 0.9.0
testthat

LinkingTo

Rcpp

Reverse Imports

covfefe
deeplr
DeepPINCS
DramaAnalysis
epitweetr
pdfsearch
proustr
rslp
textfeatures
textrecipes
tidypmc
tidytext
ttgsea
wactor

Reverse Suggests

cwbtools
edgarWebR
quanteda
torchdatasets