Package: textTinyR 1.1.8

textTinyR: Text Processing for Small or Big Data Files

It offers functions for splitting, parsing, tokenizing and creating a vocabulary for big text data files. Moreover, it includes functions for building a document-term matrix and extracting information from those (term-associations, most frequent terms). It also embodies functions for calculating token statistics (collocations, look-up tables, string dissimilarities) and functions to work with sparse matrices. Lastly, it includes functions for Word Vector Representations (i.e. 'GloVe', 'fasttext') and incorporates functions for the calculation of (pairwise) text document dissimilarities. The source code is based on 'C++11' and exported in R through the 'Rcpp', 'RcppArmadillo' and 'BH' packages.

Authors:Lampros Mouselimis [aut, cre]

textTinyR_1.1.8.tar.gz
textTinyR_1.1.8.zip(r-4.5)textTinyR_1.1.8.zip(r-4.4)textTinyR_1.1.8.zip(r-4.3)
textTinyR_1.1.8.tgz(r-4.5-x86_64)textTinyR_1.1.8.tgz(r-4.5-arm64)textTinyR_1.1.8.tgz(r-4.4-x86_64)textTinyR_1.1.8.tgz(r-4.4-arm64)textTinyR_1.1.8.tgz(r-4.3-x86_64)textTinyR_1.1.8.tgz(r-4.3-arm64)
textTinyR_1.1.8.tar.gz(r-4.5-noble)textTinyR_1.1.8.tar.gz(r-4.4-noble)
textTinyR_1.1.8.tgz(r-4.4-emscripten)textTinyR_1.1.8.tgz(r-4.3-emscripten)
textTinyR.pdf |textTinyR.html✨
textTinyR/json (API)
NEWS

# Install 'textTinyR' in R:

install.packages('textTinyR', repos = c('https://mlampros.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/mlampros/texttinyr/issues

Uses libs:

openblas– Optimized BLAS
c++– GNU Standard C++ Library v3
openmp– GCC OpenMP (GOMP) support library

On CRAN:

bh boost cpp11 processing rcpp rcpparmadillo text openblas cpp openmp

7.64 score 39 stars 1 packages 244 scripts 1.5k downloads 30 exports 7 dependencies

Last updated 1 years agofrom:11c7a8e669. Checks:12 OK. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Mar 21 2025
R-4.5-win-x86_64	OK	Mar 21 2025
R-4.5-mac-x86_64	OK	Mar 21 2025
R-4.5-mac-aarch64	OK	Mar 21 2025
R-4.5-linux-x86_64	OK	Mar 21 2025
R-4.4-win-x86_64	OK	Mar 21 2025
R-4.4-mac-x86_64	OK	Mar 21 2025
R-4.4-mac-aarch64	OK	Mar 21 2025
R-4.4-linux-x86_64	OK	Mar 21 2025
R-4.3-win-x86_64	OK	Mar 21 2025
R-4.3-mac-x86_64	OK	Mar 21 2025
R-4.3-mac-aarch64	OK	Mar 21 2025

Exports:batch_compute big_tokenize_transform bytes_converter cluster_frequency COS_TEXT cosine_distance Count_Rows dense_2sparse dice_distance dims_of_word_vecs Doc2Vec JACCARD_DICE levenshtein_distance load_sparse_binary matrix_sparsity read_characters read_rows save_sparse_binary select_predictors sparse_Means sparse_Sums sparse_term_matrix TEXT_DOC_DISSIM text_file_parser text_intersect token_stats tokenize_transform_text tokenize_transform_vec_docs utf_locale vocabulary_parser

Dependencies:BH data.table lattice Matrix R6 Rcpp RcppArmadillo

Functionality of the textTinyR package

Lampros Mouselimis

Rendered fromfunctionality_of_textTinyR_package.Rmdusingknitr::rmarkdownon Mar 21 2025.

Last update: 2021-10-29
Started: 2017-01-04

Word vectors - doc2vec - text clustering

Lampros Mouselimis

Rendered fromword_vectors_doc2vec.Rmdusingknitr::rmarkdownon Mar 21 2025.

Last update: 2021-10-29
Started: 2018-04-03

Help page	Topics
Compute batches	batch_compute
String tokenization and transformation for big data sets	big_tokenize_transform
bytes converter of a text file ( KB, MB or GB )	bytes_converter
Frequencies of an existing cluster object	cluster_frequency
Cosine similarity for text documents	COS_TEXT
cosine distance of two character strings (each string consists of more than one words)	cosine_distance
Number of rows of a file	Count_Rows
convert a dense matrix to a sparse matrix	dense_2sparse
dice similarity of words using n-grams	dice_distance
dimensions of a word vectors file	dims_of_word_vecs
Conversion of text documents to word-vector-representation features ( Doc2Vec )	Doc2Vec
Jaccard or Dice similarity for text documents	JACCARD_DICE
levenshtein distance of two words	levenshtein_distance
load a sparse matrix in binary format	load_sparse_binary
sparsity percentage of a sparse matrix	matrix_sparsity
read a specific number of characters from a text file	read_characters
read a specific number of rows from a text file	read_rows
save a sparse matrix in binary format	save_sparse_binary
Exclude highly correlated predictors	select_predictors
RowMens and colMeans for a sparse matrix	sparse_Means
RowSums and colSums for a sparse matrix	sparse_Sums
Term matrices and statistics ( document-term-matrix, term-document-matrix)	sparse_term_matrix
Dissimilarity calculation of text documents	TEXT_DOC_DISSIM
text file parser	text_file_parser
intersection of words or letters in tokenized text	text_intersect
token statistics	token_stats
String tokenization and transformation ( character string or path to a file )	tokenize_transform_text
String tokenization and transformation ( vector of documents )	tokenize_transform_vec_docs
utf-locale for the available languages	utf_locale
returns the vocabulary counts for small or medium ( xml and not only ) files	vocabulary_parser

Package: textTinyR 1.1.8

textTinyR: Text Processing for Small or Big Data Files

Functionality of the textTinyR package

Word vectors - doc2vec - text clustering

Citation

Development and contributors

Readme and manuals

Help Manual

Usage by other packages (reverse dependencies)