CRAN/E | sentencepiece

sentencepiece

Text Tokenization using Byte Pair Encoding and Unigram Modelling

Installation

About

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) doi:10.18653/v1/D18-2012. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) .

github.com/bnosac/sentencepiece

Key Metrics

Version 0.2.3
R ≥ 2.10
Published 2022-11-13 541 days ago
Needs compilation? yes
License MPL-2.0
CRAN checks sentencepiece results

Downloads

Yesterday 5 0%
Last 7 days 71 -34%
Last 30 days 285 -4%
Last 90 days 869 -20%
Last 365 days 3.531 -14%

Maintainer

Maintainer

Jan Wijffels

jwijffels@bnosac.be

Authors

Jan Wijffels

aut / cre / cph

(R wrapper)

BNOSAC

cph

(R wrapper)

Google Inc.

ctb / cph

(Files

at src/sentencepiece/src

(Apache License, Version 2.0)

The Abseil Authors

ctb / cph

(Files

at src/third_party/absl

(Apache License, Version 2.0)

Google Inc.

ctb / cph

(Files at src/third_party/protobuf-lite (BSD-3 License))

Kenton Varda

ctb / cph

(Files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h, google/protobuf/stubs/common.h, google/protobuf/stubs/hash.h, google/protobuf/stubs/once.h, google/protobuf/stubs/once.h.org (BSD-3 License))

Sanjay Ghemawat

ctb / cph

(Design of files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h (BSD-3 License))

Jeff Dean

ctb / cph

(Design of files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h (BSD-3 License))

Laszlo Csomor

ctb / cph

(Files at src/third_party/protobuf-lite: io_win32.cc, google/protobuf/stubs/io_win32.h (BSD-3 License))

Wink Saville

ctb / cph

(Files at src/third_party/protobuf-lite: message_lite.cc, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h (BSD-3 License))

Jim Meehan

ctb / cph

(Files at src/third_party/protobuf-lite: structurally_valid.cc (BSD-3 License))

Chris Atenasio

ctb / cph

(Files at src/third_party/protobuf-lite: google/protobuf/wire_format_lite.h (BSD-3 License))

Jason Hsueh

ctb / cph

(Files at src/third_party/protobuf-lite: google/protobuf/io/coded_stream_inl.h (BSD-3 License))

Anton Carver

ctb / cph

(Files at src/third_party/protobuf-lite: google/protobuf/stubs/map_util.h (BSD-3 License))

Maxim Lifantsev

ctb / cph

(Files at src/third_party/protobuf-lite: google/protobuf/stubs/mathlimits.h (BSD-3 License))

Susumu Yata

ctb / cph

(Files

at src/third_party/darts_clone

(BSD-3 License)

Daisuke Okanohara

ctb / cph

(File src/third_party/esaxx/esa.hxx (MIT License))

Yuta Mori

ctb / cph

(File src/third_party/esaxx/sais.hxx (MIT License))

Benjamin Heinzerling

ctb / cph

(Files data/models/nl.wiki.bpe.vs1000.d25.w2v.txt, data/models/nl.wiki.bpe.vs1000.d25.w2v.bin and data/models/nl.wiki.bpe.vs1000.model (MIT License))

Material

README
NEWS
Reference manual
Package source

In Views

NaturalLanguageProcessing

macOS

r-release

arm64

r-oldrel

arm64

r-release

x86_64

r-oldrel

x86_64

Windows

r-devel

x86_64

r-release

x86_64

r-oldrel

x86_64

Old Sources

sentencepiece archive

Depends

R ≥ 2.10

Imports

Rcpp ≥ 0.11.5
stats

Suggests

tokenizers.bpe
word2vec ≥ 0.2.0

LinkingTo

Rcpp

Reverse Suggests

textrecipes