CRAN/E | doc2vec

doc2vec

Distributed Representations of Sentences, Documents and Topics

Installation

About

Learn vector representations of sentences, paragraphs or documents by using the 'Paragraph Vector' algorithms, namely the distributed bag of words ('PV-DBOW') and the distributed memory ('PV-DM') model. The techniques in the package are detailed in the paper "Distributed Representations of Sentences and Documents" by Mikolov et al. (2014), available at . The package also provides an implementation to cluster documents based on these embedding using a technique called top2vec. Top2vec finds clusters in text documents by combining techniques to embed documents and words and density-based clustering. It does this by embedding documents in the semantic space as defined by the 'doc2vec' algorithm. Next it maps these document embeddings to a lower-dimensional space using the 'Uniform Manifold Approximation and Projection' (UMAP) clustering algorithm and finds dense areas in that space using a 'Hierarchical Density-Based Clustering' technique (HDBSCAN). These dense areas are the topic clusters which can be represented by the corresponding topic vector which is an aggregate of the document embeddings of the documents which are part of that topic cluster. In the same semantic space similar words can be found which are representative of the topic. More details can be found in the paper 'Top2Vec: Distributed Representations of Topics' by D. Angelov available at .

github.com/bnosac/doc2vec

Key Metrics

Version 0.2.0
R ≥ 2.10
Published 2021-03-28 1132 days ago
Needs compilation? yes
License MIT
License File
CRAN checks doc2vec results

Downloads

Yesterday 68 +467%
Last 7 days 235 -1%
Last 30 days 799 -6%
Last 90 days 2.372 +10%
Last 365 days 8.997 +10%

Maintainer

Maintainer

Jan Wijffels

jwijffels@bnosac.be

Authors

Jan Wijffels

aut / cre / cph

(R wrapper)

BNOSAC

cph

(R wrapper)

hiyijian

ctb / cph

(Code in src/doc2vec)

Material

README
NEWS
Reference manual
Package source

macOS

r-release

arm64

r-oldrel

arm64

r-release

x86_64

r-oldrel

x86_64

Windows

r-devel

x86_64

r-release

x86_64

r-oldrel

x86_64

Old Sources

doc2vec archive

Depends

R ≥ 2.10

Imports

Rcpp ≥ 0.11.5
stats
utils

Suggests

tokenizers.bpe
word2vec ≥ 0.3.3
uwot
dbscan
udpipe ≥0.8

LinkingTo

Rcpp