Package 'textTinyR'

Title: Text Processing for Small or Big Data Files
Description: It offers functions for splitting, parsing, tokenizing and creating a vocabulary for big text data files. Moreover, it includes functions for building a document-term matrix and extracting information from those (term-associations, most frequent terms). It also embodies functions for calculating token statistics (collocations, look-up tables, string dissimilarities) and functions to work with sparse matrices. Lastly, it includes functions for Word Vector Representations (i.e. 'GloVe', 'fasttext') and incorporates functions for the calculation of (pairwise) text document dissimilarities. The source code is based on 'C++11' and exported in R through the 'Rcpp', 'RcppArmadillo' and 'BH' packages.
Authors: Lampros Mouselimis [aut, cre]
Maintainer: Lampros Mouselimis <[email protected]>
License: GPL-3
Version: 1.1.8
Built: 2024-11-21 05:37:07 UTC
Source: https://github.com/mlampros/texttinyr

Help Index


Compute batches

Description

Compute batches

Usage

batch_compute(n_rows, n_batches)

Arguments

n_rows

a numeric specifying the number of rows

n_batches

a numeric specifying the number of output batches

Value

a list

Examples

library(textTinyR)

btch = batch_compute(n_rows = 1000, n_batches = 10)

String tokenization and transformation for big data sets

Description

String tokenization and transformation for big data sets

String tokenization and transformation for big data sets

Usage

# utl <- big_tokenize_transform$new(verbose = FALSE)

Details

the big_text_splitter function splits a text file into sub-text-files using either the batches parameter (big-text-splitter-bytes) or both the batches and the end_query parameter (big-text-splitter-query). The end_query parameter (if not NULL) should be a character string specifying a word that appears repeatedly at the end of each line in the text file.

the big_text_parser function parses text files from an input folder and saves those processed files to an output folder. The big_text_parser is appropriate for files with a structure using the start- and end- query parameters.

the big_text_tokenizer function tokenizes and transforms the text files of a folder and saves those files to either a folder or a single file. There is also the option to save a frequency vocabulary of those transformed tokens to a file.

the vocabulary_accumulator function takes the resulted vocabulary files of the big_text_tokenizer and returns the vocabulary sums sorted in decreasing order. The parameter max_num_chars limits the number of the corpus using the number of characters of each word.

The ngram_sequential or ngram_overlap stemming method applies to each single batch and not to the whole corpus of the text file. Thus, it is possible that the stems of the same words for randomly selected batches might differ.

Methods

big_tokenize_transform$new(verbose = FALSE)
--------------
big_text_splitter(input_path_file = NULL, output_path_folder = NULL, end_query = NULL, batches = NULL, trimmed_line = FALSE)
--------------
big_text_parser(input_path_folder = NULL, output_path_folder = NULL, start_query = NULL, end_query = NULL, min_lines = 1, trimmed_line = FALSE)
--------------
big_text_tokenizer(input_path_folder = NULL, batches = NULL, read_file_delimiter = " ", to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " .,;:()?!", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", concat_delimiter = NULL, path_2folder = "", stemmer_ngram = 4, stemmer_gamma = 0.0, stemmer_truncate = 3, stemmer_batches = 1, threads = 1, save_2single_file = FALSE, increment_batch_nr = 1, vocabulary_path_folder = NULL)
--------------
vocabulary_accumulator(input_path_folder = NULL, vocabulary_path_file = NULL, max_num_chars = 100)

Methods

Public methods


Method new()

Usage
big_tokenize_transform$new(verbose = FALSE)
Arguments
verbose

either TRUE or FALSE. If TRUE then information will be printed in the console


Method big_text_splitter()

Usage
big_tokenize_transform$big_text_splitter(
  input_path_file = NULL,
  output_path_folder = NULL,
  end_query = NULL,
  batches = NULL,
  trimmed_line = FALSE
)
Arguments
input_path_file

a character string specifying the path to the input file

output_path_folder

a character string specifying the folder where the output files should be saved

end_query

a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file.

batches

a numeric value specifying the number of batches to use. The batches will be used to split the initial data into subsets. Those subsets will be either saved in files (big_text_splitter function) or will be used internally for low memory processing (big_text_tokenizer function).

trimmed_line

either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query


Method big_text_parser()

Usage
big_tokenize_transform$big_text_parser(
  input_path_folder = NULL,
  output_path_folder = NULL,
  start_query = NULL,
  end_query = NULL,
  min_lines = 1,
  trimmed_line = FALSE
)
Arguments
input_path_folder

a character string specifying the folder where the input files are saved

output_path_folder

a character string specifying the folder where the output files should be saved

start_query

a character string. The start_query is the first word of the subset of the data and should appear frequently at the beginning of each line int the text file.

end_query

a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file.

min_lines

a numeric value specifying the minimum number of lines. For instance if min_lines = 2, then only subsets of text with more than 1 lines will be kept.

trimmed_line

either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query


Method big_text_tokenizer()

Usage
big_tokenize_transform$big_text_tokenizer(
  input_path_folder = NULL,
  batches = NULL,
  read_file_delimiter = "\n",
  to_lower = FALSE,
  to_upper = FALSE,
  utf_locale = "",
  remove_char = "",
  remove_punctuation_string = FALSE,
  remove_punctuation_vector = FALSE,
  remove_numbers = FALSE,
  trim_token = FALSE,
  split_string = FALSE,
  split_separator = " \r\n\t.,;:()?!//",
  remove_stopwords = FALSE,
  language = "english",
  min_num_char = 1,
  max_num_char = Inf,
  stemmer = NULL,
  min_n_gram = 1,
  max_n_gram = 1,
  skip_n_gram = 1,
  skip_distance = 0,
  n_gram_delimiter = " ",
  concat_delimiter = NULL,
  path_2folder = "",
  stemmer_ngram = 4,
  stemmer_gamma = 0,
  stemmer_truncate = 3,
  stemmer_batches = 1,
  threads = 1,
  save_2single_file = FALSE,
  increment_batch_nr = 1,
  vocabulary_path_folder = NULL
)
Arguments
input_path_folder

a character string specifying the folder where the input files are saved

batches

a numeric value specifying the number of batches to use. The batches will be used to split the initial data into subsets. Those subsets will be either saved in files (big_text_splitter function) or will be used internally for low memory processing (big_text_tokenizer function).

read_file_delimiter

the delimiter to use when the input file will be red (for instance a tab-delimiter or a new-line delimiter).

to_lower

either TRUE or FALSE. If TRUE the character string will be converted to lower case

to_upper

either TRUE or FALSE. If TRUE the character string will be converted to upper case

utf_locale

the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases.

remove_char

a character string with specific characters that should be removed from the text file. If the remove_char is "" then no removal of characters take place

remove_punctuation_string

either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)

remove_punctuation_vector

either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)

remove_numbers

either TRUE or FALSE. If TRUE then any numbers in the character string will be removed

trim_token

either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right)

split_string

either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.

split_separator

a character string specifying the character delimiter(s)

remove_stopwords

either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded.

language

a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu

min_num_char

an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned

max_num_char

an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000)

stemmer

a character string specifying the stemming method. One of the following porter2_stemmer, ngram_sequential, ngram_overlap. See details for more information.

min_n_gram

an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.

max_n_gram

an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.

skip_n_gram

an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1.

skip_distance

an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.

n_gram_delimiter

a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)

concat_delimiter

either NULL or a character string specifying the delimiter to use in order to concatenate the end-vector of character strings to a single character string (recommended in case that the end-vector should be saved to a file)

path_2folder

a character string specifying the path to the folder where the file(s) will be saved

stemmer_ngram

a numeric value greater than 1. Applies to both ngram_sequential and ngram_overlap methods. In case of ngram_sequential the first stemmer_ngram characters will be picked, whereas in the case of ngram_overlap the overlapping stemmer_ngram characters will be build.

stemmer_gamma

a float number greater or equal to 0.0. Applies only to ngram_sequential. Is a threshold value, which defines how much frequency deviation of two N-grams is acceptable. It is kept either zero or to a minimum value.

stemmer_truncate

a numeric value greater than 0. Applies only to ngram_sequential. The ngram_sequential is modified to use relative frequencies (float numbers between 0.0 and 1.0 for the ngrams of a specific word in the corpus) and the stemmer_truncate parameter controls the number of rounding digits for the ngrams of the word. The main purpose was to give the same relative frequency to words appearing approximately the same on the corpus.

stemmer_batches

a numeric value greater than 0. Applies only to ngram_sequential. Splits the corpus into batches with the option to run the batches in multiple threads.

threads

an integer specifying the number of cores to run in parallel

save_2single_file

either TRUE or FALSE. If TRUE then the output data will be saved in a single file. Otherwise the data will be saved in multiple files with incremented enumeration

increment_batch_nr

a numeric value. The enumeration of the output files will start from the increment_batch_nr. If the save_2single_file parameter is TRUE then the increment_batch_nr parameter won't be taken into consideration.

vocabulary_path_folder

either NULL or a character string specifying the output folder where the vocabulary batches should be saved (after tokenization and transformation is applied). Applies to the big_text_tokenizer method.


Method vocabulary_accumulator()

Usage
big_tokenize_transform$vocabulary_accumulator(
  input_path_folder = NULL,
  vocabulary_path_file = NULL,
  max_num_chars = 100
)
Arguments
input_path_folder

a character string specifying the folder where the input files are saved

vocabulary_path_file

either NULL or a character string specifying the output file where the vocabulary should be saved (after tokenization and transformation is applied). Applies to the vocabulary_accumulator method.

max_num_chars

a numeric value to limit the words of the output vocabulary to a maximum number of characters (applies to the vocabulary_accumulator function)


Method clone()

The objects of this class are cloneable with this method.

Usage
big_tokenize_transform$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

## Not run: 

library(textTinyR)

fs <- big_tokenize_transform$new(verbose = FALSE)

#---------------
# file splitter:
#---------------

fs$big_text_splitter(input_path_file = "input.txt",
                     output_path_folder = "/folder/output/",
                     end_query = "endword", batches = 5,
                     trimmed_line = FALSE)


#-------------
# file parser:
#-------------

fs$big_text_parser(input_path_folder = "/folder/output/",
                    output_path_folder = "/folder/parser/",
                    start_query = "startword", end_query = "endword",
                    min_lines = 1, trimmed_line = TRUE)


#----------------
# file tokenizer:
#----------------


 fs$big_text_tokenizer(input_path_folder = "/folder/parser/",
                       batches = 5, split_string=TRUE,
                       to_lower = TRUE, trim_token = TRUE,
                       max_num_char = 100, remove_stopwords = TRUE,
                       stemmer = "porter2_stemmer", threads = 1,
                       path_2folder="/folder/output_token/",
                       vocabulary_path_folder="/folder/VOCAB/")

#-------------------
# vocabulary counts:
#-------------------


fs$vocabulary_accumulator(input_path_folder = "/folder/VOCAB/",
                           vocabulary_path_file = "/folder/vocab.txt",
                           max_num_chars = 50)


## End(Not run)

bytes converter of a text file ( KB, MB or GB )

Description

bytes converter of a text file ( KB, MB or GB )

Usage

bytes_converter(input_path_file = NULL, unit = "MB")

Arguments

input_path_file

a character string specifying the path to the input file

unit

a character string specifying the unit. One of KB, MB, GB

Value

a number

Examples

## Not run: 

library(textTinyR)

bc = bytes_converter(input_path_file = 'some_file.txt', unit = "MB")


## End(Not run)

Frequencies of an existing cluster object

Description

Frequencies of an existing cluster object

Usage

cluster_frequency(tokenized_list_text, cluster_vector, verbose = FALSE)

Arguments

tokenized_list_text

a list of tokenized text documents. This can be the result of the textTinyR::tokenize_transform_vec_docs function with the as_token parameter set to TRUE (the token object of the output)

cluster_vector

a numeric vector. This can be the result of the ClusterR::KMeans_rcpp function (the clusters object of the output)

verbose

either TRUE or FALSE. If TRUE then information will be printed out in the R session.

Details

This function takes a list of tokenized text and a numeric vector of clusters and returns the sorted frequency of each cluster. The length of the tokenized_list_text object must be equal to the length of the cluster_vector object

Value

a list of data.tables

Examples

library(textTinyR)

tok_lst = list(c('the', 'the', 'tokens', 'of', 'first', 'document'),
               c('the', 'tokens', 'of', 'of', 'second', 'document'),
               c('the', 'tokens', 'of', 'third', 'third', 'document'))

vec_clust = rep(1:6, 3)

res = cluster_frequency(tok_lst, vec_clust)

Cosine similarity for text documents

Description

Cosine similarity for text documents

Usage

COS_TEXT(
  text_vector1 = NULL,
  text_vector2 = NULL,
  threads = 1,
  separator = " "
)

Arguments

text_vector1

a character string vector representing text documents (it should have the same length as the text_vector2)

text_vector2

a character string vector representing text documents (it should have the same length as the text_vector1)

threads

a numeric value specifying the number of cores to run in parallel

separator

specifies the separator used between words of each character string in the text vectors

Details

The function calculates the cosine distance between pairs of text sequences of two character string vectors

Value

a numeric vector

Examples

library(textTinyR)

vec1 = c('use this', 'function to compute the')

vec2 = c('cosine distance', 'between text sequences')

out = COS_TEXT(text_vector1 = vec1, text_vector2 = vec2, separator = " ")

cosine distance of two character strings (each string consists of more than one words)

Description

cosine distance of two character strings (each string consists of more than one words)

Usage

cosine_distance(sentence1, sentence2, split_separator = " ")

Arguments

sentence1

a character string consisting of multiple words

sentence2

a character string consisting of multiple words

split_separator

a character string specifying the delimiter(s) to split the sentence

Value

a float number

Examples

library(textTinyR)

sentence1 = 'this is one sentence'

sentence2 = 'this is a similar sentence'

cds = cosine_distance(sentence1, sentence2)

Number of rows of a file

Description

Number of rows of a file

Usage

Count_Rows(PATH, verbose = FALSE)

Arguments

PATH

a character string specifying the path to a file

verbose

either TRUE or FALSE

Details

This function returns the number of rows for a file. It doesn't load the data in memory.

Value

a numeric value

Examples

library(textTinyR)

PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR")

num_rows = Count_Rows(PATH)

convert a dense matrix to a sparse matrix

Description

convert a dense matrix to a sparse matrix

Usage

dense_2sparse(dense_mat)

Arguments

dense_mat

a dense matrix

Value

a sparse matrix

Examples

library(textTinyR)

tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10)

sp_mat = dense_2sparse(tmp)

dice similarity of words using n-grams

Description

dice similarity of words using n-grams

Usage

dice_distance(word1, word2, n_grams = 2)

Arguments

word1

a character string

word2

a character string

n_grams

a value specifying the consecutive n-grams of the words

Value

a float number

Examples

library(textTinyR)

word1 = 'one_word'

word2 = 'two_words'

dts = dice_distance(word1, word2, n_grams = 2)

dimensions of a word vectors file

Description

dimensions of a word vectors file

Usage

dims_of_word_vecs(input_file = NULL, read_delimiter = "\n")

Arguments

input_file

a character string specifying a valid path to a text file

read_delimiter

a character string specifying the row delimiter of the text file

Details

This function takes a valid path to a file and a file delimiter as input and estimates the dimensions of the word vectors by using the first row of the file.

Value

a numeric value

Examples

library(textTinyR)

PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR")

dimensions = dims_of_word_vecs(input_file = PATH)

Conversion of text documents to word-vector-representation features ( Doc2Vec )

Description

Conversion of text documents to word-vector-representation features ( Doc2Vec )

Conversion of text documents to word-vector-representation features ( Doc2Vec )

Usage

# utl <- Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL,

       #                    print_every_rows = 10000, verbose = FALSE,

       #                    copy_data = FALSE)

Details

the pre_processed_wv method should be used after the initialization of the Doc2Vec class, if the copy_data parameter is set to TRUE, in order to inspect the pre-processed word-vectors.

The global_term_weights method is part of the sparse_term_matrix R6 class of the textTinyR package. One can come to the correct global_term_weights by using the sparse_term_matrix class and by setting the tf_idf parameter to FALSE and the normalize parameter to NULL. In Doc2Vec class, if method equals to idf then the global_term_weights parameter should not be equal to NULL.

Explanation of the various methods :

sum_sqrt

Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be accumulated to a vector equal to the length of the wordvector (INITIAL_WORD_VECTOR). Then a scalar will be computed using this INITIAL_WORD_VECTOR in the following way : the INITIAL_WORD_VECTOR will be raised to the power of 2.0, then the resulted wordvector will be summed and the square-root will be calculated. The INITIAL_WORD_VECTOR will be divided by the resulted scalar

min_max_norm

Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be first min-max normalized and then will be accumulated to a vector equal to the length of the initial wordvector

idf

Assuming that a single sublist of the token list will be taken into consideration : the word-vector of each term in the sublist will be multiplied with the corresponding idf of the global weights term

There might be slight differences in the output data for each method depending on the input value of the copy_data parameter (if it's either TRUE or FALSE).

Value

a matrix

Methods

Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE)
--------------
doc2vec_methods(method = "sum_sqrt", global_term_weights = NULL, threads = 1)
--------------
pre_processed_wv()

Methods

Public methods


Method new()

Usage
Doc2Vec$new(
  token_list = NULL,
  word_vector_FILE = NULL,
  print_every_rows = 10000,
  verbose = FALSE,
  copy_data = FALSE
)
Arguments
token_list

either NULL or a list of tokenized text documents

word_vector_FILE

a valid path to a text file, where the word-vectors are saved

print_every_rows

a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function especially in case of big files.

verbose

either TRUE or FALSE. If TRUE then information will be printed out in the R session.

copy_data

either TRUE or FALSE. If FALSE then a pointer will be created and no copy of the initial data takes place (memory efficient especially for big datasets). This is an alternative way to pre-process the data.


Method doc2vec_methods()

Usage
Doc2Vec$doc2vec_methods(
  method = "sum_sqrt",
  global_term_weights = NULL,
  threads = 1
)
Arguments
method

a character string specifying the method to use. One of sum_sqrt, min_max_norm or idf. See the details section for more information.

global_term_weights

either NULL or the output of the global_term_weights method of the textTinyR package. See the details section for more information.

threads

a numeric value specifying the number of cores to run in parallel


Method pre_processed_wv()

Usage
Doc2Vec$pre_processed_wv()

Method clone()

The objects of this class are cloneable with this method.

Usage
Doc2Vec$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

library(textTinyR)

#---------------------------------
# tokenized text in form of a list
#---------------------------------

tok_text = list(c('the', 'result', 'of'), c('doc2vec', 'are', 'vector', 'features'))

#-------------------------
# path to the word vectors
#-------------------------

PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR")


init = Doc2Vec$new(token_list = tok_text, word_vector_FILE = PATH)


out = init$doc2vec_methods(method = "sum_sqrt")

Jaccard or Dice similarity for text documents

Description

Jaccard or Dice similarity for text documents

Usage

JACCARD_DICE(
  token_list1 = NULL,
  token_list2 = NULL,
  method = "jaccard",
  threads = 1
)

Arguments

token_list1

a list of tokenized text documents (it should have the same length as the token_list2)

token_list2

a list of tokenized text documents (it should have the same length as the token_list1)

method

a character string specifying the similarity metric. One of 'jaccard', 'dice'

threads

a numeric value specifying the number of cores to run in parallel

Details

The function calculates either the jaccard or the dice distance between pairs of tokenized text of two lists

Value

a numeric vector

Examples

library(textTinyR)

lst1 = list(c('use', 'this', 'function', 'to'), c('either', 'compute', 'the', 'jaccard'))

lst2 = list(c('or', 'the', 'dice', 'distance'), c('for', 'two', 'same', 'sized', 'lists'))

out = JACCARD_DICE(token_list1 = lst1, token_list2 = lst2, method = 'jaccard', threads = 1)

levenshtein distance of two words

Description

levenshtein distance of two words

Usage

levenshtein_distance(word1, word2)

Arguments

word1

a character string

word2

a character string

Value

a float number

Examples

library(textTinyR)

word1 = 'one_word'

word2 = 'two_words'

lvs = levenshtein_distance(word1, word2)

load a sparse matrix in binary format

Description

load a sparse matrix in binary format

Usage

load_sparse_binary(file_name = "save_sparse.mat")

Arguments

file_name

a character string specifying the binary file

Value

loads a sparse matrix from a file

Examples

## Not run: 

library(textTinyR)

load_sparse_binary(file_name = "save_sparse.mat")


## End(Not run)

sparsity percentage of a sparse matrix

Description

sparsity percentage of a sparse matrix

Usage

matrix_sparsity(sparse_matrix)

Arguments

sparse_matrix

a sparse matrix

Value

a numeric value (percentage)

Examples

library(textTinyR)

tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10)

sp_mat = dense_2sparse(tmp)

dbl = matrix_sparsity(sp_mat)

read a specific number of characters from a text file

Description

read a specific number of characters from a text file

Usage

read_characters(input_file = NULL, characters = 100, write_2file = "")

Arguments

input_file

a character string specifying a valid path to a text file

characters

a numeric value specifying the number of characters to read

write_2file

either an empty string ("") or a character string specifying a valid output file to write the subset of the input file

Examples

## Not run: 

library(textTinyR)

txfl = read_characters(input_file = 'input.txt', characters = 100)


## End(Not run)

read a specific number of rows from a text file

Description

read a specific number of rows from a text file

Usage

read_rows(
  input_file = NULL,
  read_delimiter = "\n",
  rows = 100,
  write_2file = ""
)

Arguments

input_file

a character string specifying a valid path to a text file

read_delimiter

a character string specifying the row delimiter of the text file

rows

a numeric value specifying the number of rows to read

write_2file

either "" or a character string specifying a valid output file to write the subset of the input file

Examples

## Not run: 

library(textTinyR)

txfl = read_rows(input_file = 'input.txt', rows = 100)


## End(Not run)

save a sparse matrix in binary format

Description

save a sparse matrix in binary format

Usage

save_sparse_binary(sparse_matrix, file_name = "save_sparse.mat")

Arguments

sparse_matrix

a sparse matrix

file_name

a character string specifying the binary file

Value

writes the sparse matrix to a file

Examples

library(textTinyR)

tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10)

sp_mat = dense_2sparse(tmp)

# save_sparse_binary(sp_mat, file_name = "save_sparse.mat")

Exclude highly correlated predictors

Description

Exclude highly correlated predictors

Usage

select_predictors(
  response_vector,
  predictors_matrix,
  response_lower_thresh = 0.1,
  predictors_upper_thresh = 0.75,
  threads = 1,
  verbose = FALSE
)

Arguments

response_vector

a numeric vector (the length should be equal to the rows of the predictors_matrix parameter)

predictors_matrix

a numeric matrix (the rows should be equal to the length of the response_vector parameter)

response_lower_thresh

a numeric value. This parameter allows the user to keep all the predictors having a correlation with the response greater than the response_lower_thresh value.

predictors_upper_thresh

a numeric value. This parameter allows the user to keep all the predictors having a correlation comparing to the other predictors less than the predictors_upper_thresh value.

threads

a numeric value specifying the number of cores to run in parallel

verbose

either TRUE or FALSE. If TRUE then information will be printed out in the R session.

Details

The function works in the following way : The correlation of the predictors with the response is first calculated and the resulted correlations are sorted in decreasing order. Then iteratively predictors with correlation higher than the predictors_upper_thresh value are removed by favoring those predictors which are more correlated with the response variable. If the response_lower_thresh value is greater than 0.0 then only predictors having a correlation higher than or equal to the response_lower_thresh value will be kept, otherwise they will be excluded. This function returns the indices of the predictors and is useful in case of multicollinearity.

If during computation the correlation between the response variable and a potential predictor is equal to NA or +/- Inf, then a correlation of 0.0 will be assigned to this particular pair.

Value

a vector of column-indices

Examples

library(textTinyR)

set.seed(1)
resp = runif(100)

set.seed(2)
col = runif(100)

matr = matrix(c(col, col^4, col^6, col^8, col^10), nrow = 100, ncol = 5)

out = select_predictors(resp, matr, predictors_upper_thresh = 0.75)

RowMens and colMeans for a sparse matrix

Description

RowMens and colMeans for a sparse matrix

Usage

sparse_Means(sparse_matrix, rowMeans = FALSE)

Arguments

sparse_matrix

a sparse matrix

rowMeans

either TRUE or FALSE. If TRUE then the row-means will be calculated, otherwise the column-means

Value

a vector with either the row- or the column-sums of the matrix

Examples

library(textTinyR)

tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10)

sp_mat = dense_2sparse(tmp)

spsm = sparse_Means(sp_mat, rowMeans = FALSE)

RowSums and colSums for a sparse matrix

Description

RowSums and colSums for a sparse matrix

Usage

sparse_Sums(sparse_matrix, rowSums = FALSE)

Arguments

sparse_matrix

a sparse matrix

rowSums

either TRUE or FALSE. If TRUE then the row-sums will be calculated, otherwise the column-sums

Value

a vector with either the row- or the column-sums of the matrix

Examples

library(textTinyR)

tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10)

sp_mat = dense_2sparse(tmp)

spsm = sparse_Sums(sp_mat, rowSums = FALSE)

Term matrices and statistics ( document-term-matrix, term-document-matrix)

Description

Term matrices and statistics ( document-term-matrix, term-document-matrix)

Term matrices and statistics ( document-term-matrix, term-document-matrix)

Usage

# utl <- sparse_term_matrix$new(vector_data = NULL, file_data = NULL,

#                                      document_term_matrix = TRUE)

Details

the Term_Matrix function takes either a character vector of strings or a text file and after tokenization and transformation returns either a document-term-matrix or a term-document-matrix

the triplet_data function returns the triplet data, which is used internally (in c++), to construct the Term Matrix. The triplet data could be usefull for secondary purposes, such as in word vector representations.

the global_term_weights function returns a list of length two. The first sublist includes the terms and the second sublist the global-term-weights. The tf_idf parameter should be set to FALSE and the normalize parameter to NULL. This function is normally used in conjuction with word-vector-embeddings.

the Term_Matrix_Adjust function removes sparse terms from a sparse matrix using a sparsity threshold

the term_associations function finds the associations between the given terms (Terms argument) and all the other terms in the corpus by calculating their correlation. There is also the option to keep a specific number of terms from the output table using the keep_terms parameter.

the most_frequent_terms function returns the most frequent terms of the corpus using the output of the sparse matrix. The user has the option to keep a specific number of terms from the output table using the keep_terms parameter.

Stemming of the english language is done using the porter2-stemmer, for details see https://github.com/smassung/porter2_stemmer

Methods

sparse_term_matrix$new(vector_data = NULL, file_data = NULL, document_term_matrix = TRUE)
--------------
Term_Matrix(sort_terms = FALSE, to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " .,;:()?!", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", print_every_rows = 1000, normalize = NULL, tf_idf = FALSE, threads = 1, verbose = FALSE)
--------------
triplet_data()
--------------
global_term_weights()
--------------
Term_Matrix_Adjust(sparsity_thresh = 1.0)
--------------
term_associations(Terms = NULL, keep_terms = NULL, verbose = FALSE)
--------------
most_frequent_terms(keep_terms = NULL, threads = 1, verbose = FALSE)

Methods

Public methods


Method new()

Usage
sparse_term_matrix$new(
  vector_data = NULL,
  file_data = NULL,
  document_term_matrix = TRUE
)
Arguments
vector_data

either NULL or a character vector of documents

file_data

either NULL or a valid character path to a text file

document_term_matrix

either TRUE or FALSE. If TRUE then a document-term-matrix will be returned, otherwise a term-document-matrix


Method Term_Matrix()

Usage
sparse_term_matrix$Term_Matrix(
  sort_terms = FALSE,
  to_lower = FALSE,
  to_upper = FALSE,
  utf_locale = "",
  remove_char = "",
  remove_punctuation_string = FALSE,
  remove_punctuation_vector = FALSE,
  remove_numbers = FALSE,
  trim_token = FALSE,
  split_string = FALSE,
  split_separator = " \r\n\t.,;:()?!//",
  remove_stopwords = FALSE,
  language = "english",
  min_num_char = 1,
  max_num_char = Inf,
  stemmer = NULL,
  min_n_gram = 1,
  max_n_gram = 1,
  skip_n_gram = 1,
  skip_distance = 0,
  n_gram_delimiter = " ",
  print_every_rows = 1000,
  normalize = NULL,
  tf_idf = FALSE,
  threads = 1,
  verbose = FALSE
)
Arguments
sort_terms

either TRUE or FALSE specifying if the initial terms should be sorted ( so that the output sparse matrix is sorted in alphabetical order )

to_lower

either TRUE or FALSE. If TRUE the character string will be converted to lower case

to_upper

either TRUE or FALSE. If TRUE the character string will be converted to upper case

utf_locale

the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases.

remove_char

a string specifying the specific characters that should be removed from a text file. If the remove_char is "" then no removal of characters take place

remove_punctuation_string

either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)

remove_punctuation_vector

either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)

remove_numbers

either TRUE or FALSE. If TRUE then any numbers in the character string will be removed

trim_token

either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right)

split_string

either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.

split_separator

a character string specifying the character delimiter(s)

remove_stopwords

either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded.

language

a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu

min_num_char

an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned

max_num_char

an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000)

stemmer

a character string specifying the stemming method. Available method is the porter2_stemmer. See details for more information.

min_n_gram

an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.

max_n_gram

an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.

skip_n_gram

an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1.

skip_distance

an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.

n_gram_delimiter

a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)

print_every_rows

a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function in case of big files.

normalize

either NULL or one of 'l1' or 'l2' normalization.

tf_idf

either TRUE or FALSE. If TRUE then the term-frequency-inverse-document-frequency will be returned

threads

an integer specifying the number of cores to run in parallel

verbose

either TRUE or FALSE. If TRUE then information will be printed out


Method triplet_data()

Usage
sparse_term_matrix$triplet_data()

Method global_term_weights()

Usage
sparse_term_matrix$global_term_weights()

Method Term_Matrix_Adjust()

Usage
sparse_term_matrix$Term_Matrix_Adjust(sparsity_thresh = 1)
Arguments
sparsity_thresh

a float number between 0.0 and 1.0 specifying the sparsity threshold in the Term_Matrix_Adjust function


Method term_associations()

Usage
sparse_term_matrix$term_associations(
  Terms = NULL,
  keep_terms = NULL,
  verbose = FALSE
)
Arguments
Terms

a character vector specifying the character strings for which the associations will be calculated ( term_associations function )

keep_terms

either NULL or a numeric value specifying the number of terms to keep ( both in term_associations and most_frequent_terms functions )

verbose

either TRUE or FALSE. If TRUE then information will be printed out


Method most_frequent_terms()

Usage
sparse_term_matrix$most_frequent_terms(
  keep_terms = NULL,
  threads = 1,
  verbose = FALSE
)
Arguments
keep_terms

either NULL or a numeric value specifying the number of terms to keep ( both in term_associations and most_frequent_terms functions )

threads

an integer specifying the number of cores to run in parallel

verbose

either TRUE or FALSE. If TRUE then information will be printed out


Method clone()

The objects of this class are cloneable with this method.

Usage
sparse_term_matrix$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

## Not run: 

library(textTinyR)


sm <- sparse_term_matrix$new(file_data = "/folder/my_data.txt",
                              document_term_matrix = TRUE)

#--------------
# term matrix :
#--------------

sm$Term_Matrix(sort_terms = TRUE, to_lower = TRUE,
                trim_token = TRUE, split_string = TRUE,
                remove_stopwords = TRUE, normalize = 'l1',
                stemmer = 'porter2_stemmer', threads = 1 )

#---------------
# triplet data :
#---------------

sm$triplet_data()


#----------------------
# global-term-weights :
#----------------------

sm$global_term_weights()


#-------------------------
# removal of sparse terms:
#-------------------------

sm$Term_Matrix_Adjust(sparsity_thresh = 0.995)


#-----------------------------------------------
# associations between terms of a sparse matrix:
#-----------------------------------------------

sm$term_associations(Terms = c("word", "sentence"), keep_terms = 10)


#---------------------------------------------
# most frequent terms using the sparse matrix:
#---------------------------------------------

sm$most_frequent_terms(keep_terms = 10, threads = 1)


## End(Not run)

Dissimilarity calculation of text documents

Description

Dissimilarity calculation of text documents

Usage

TEXT_DOC_DISSIM(
  first_matr = NULL,
  second_matr = NULL,
  method = "euclidean",
  batches = NULL,
  threads = 1,
  verbose = FALSE
)

Arguments

first_matr

a numeric matrix where each row represents a text document ( has same dimensions as the second_matr )

second_matr

a numeric matrix where each row represents a text document ( has same dimensions as the first_matr )

method

a dissimilarity metric in form of a character string. One of euclidean, manhattan, chebyshev, canberra, braycurtis, pearson_correlation, cosine, simple_matching_coefficient, hamming, jaccard_coefficient, Rao_coefficient

batches

a numeric value specifying the number of batches

threads

a numeric value specifying the number of cores to run in parallel

verbose

either TRUE or FALSE. If TRUE then information will be printed in the console

Details

Row-wise dissimilarity calculation of text documents. The text document sequences should be converted to numeric matrices using for instance LSI (Latent Semantic Indexing). If the numeric matrices are too big to be pre-processed, then one should use the batches parameter to split the data in batches before applying one of the dissimilarity metrics. For parallelization (threads) OpenMP will be used.

Value

a numeric vector

Examples

## Not run: 

library(textTinyR)


# example input LSI matrices (see details section)
#-------------------------------------------------

set.seed(1)
LSI_matrix1 = matrix(runif(10000), 100, 100)

set.seed(2)
LSI_matrix2 = matrix(runif(10000), 100, 100)


txt_out = TEXT_DOC_DISSIM(first_matr = LSI_matrix1,

                          second_matr = LSI_matrix2, 'euclidean')

## End(Not run)

text file parser

Description

text file parser

Usage

text_file_parser(
  input_path_file = NULL,
  output_path_file = "",
  start_query = NULL,
  end_query = NULL,
  min_lines = 1,
  trimmed_line = FALSE,
  verbose = FALSE
)

Arguments

input_path_file

either a path to an input file or a vector of character strings ( normally the latter would represent ordered lines of a text file in form of a character vector )

output_path_file

either an empty character string ("") or a character string specifying a path to an output file ( it applies only if the input_path_file parameter is a valid path to a file )

start_query

a character string or a vector of character strings. The start_query (if it's a single character string) is the first word of the subset of the data and should appear frequently at the beginning of each line in the text file.

end_query

a character string or a vector of character strings. The end_query (if it's a single character string) is the last word of the subset of the data and should appear frequently at the end of each line in the text file.

min_lines

a numeric value specifying the minimum number of lines ( applies only if the input_path_file is a valid path to a file) . For instance if min_lines = 2, then only subsets of text with more than 1 lines will be pre-processed.

trimmed_line

either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query

verbose

either TRUE or FALSE. If TRUE then information will be printed in the console

Details

The text file should have a structure (such as an xml-structure), so that subsets can be extracted using the start_query and end_query parameters ( the same applies in case of a vector of character strings)

Examples

## Not run: 

library(textTinyR)

# In case that the 'input_path_file' is a valid path
#---------------------------------------------------

fp = text_file_parser(input_path_file = '/folder/input_data.txt',
                       output_path_file = '/folder/output_data.txt',
                       start_query = 'word_a', end_query = 'word_w',
                       min_lines = 1, trimmed_line = FALSE)


# In case that the 'input_path_file' is a character vector of strings
#--------------------------------------------------------------------

PATH_url = "https://FILE.xml"
con = url(PATH_url, method = "libcurl")
tmp_dat = read.delim(con, quote = "\"", comment.char = "", stringsAsFactors = FALSE)

vec_docs = unlist(lapply(1:length(as.vector(tmp_dat[, 1])), function(x)
                    trimws(tmp_dat[x, 1], which = "both")))

parse_data = text_file_parser(input_path_file = vec_docs,
                                start_query = c("<query1>", "<query2>", "<query3>"),
                                end_query = c("</query1>", "</query2>", "</query3>"),
                                min_lines = 1, trimmed_line = TRUE)
                                

## End(Not run)

intersection of words or letters in tokenized text

Description

intersection of words or letters in tokenized text

intersection of words or letters in tokenized text

Usage

# utl <- text_intersect$new(token_list1 = NULL, token_list2 = NULL)

Details

This class includes methods for text or character intersection. If both distinct and letters are FALSE then the simple (count or ratio) word intersection will be computed.

Value

a numeric vector

Methods

text_intersect$new(file_data = NULL)
--------------
count_intersect(distinct = FALSE, letters = FALSE)
--------------
ratio_intersect(distinct = FALSE, letters = FALSE)

Methods

Public methods


Method new()

Usage
text_intersect$new(token_list1 = NULL, token_list2 = NULL)
Arguments
token_list1

a list, where each sublist is a tokenized text sequence (token_list1 should be of same length with token_list2)

token_list2

a list, where each sublist is a tokenized text sequence (token_list2 should be of same length with token_list1)


Method count_intersect()

Usage
text_intersect$count_intersect(distinct = FALSE, letters = FALSE)
Arguments
distinct

either TRUE or FALSE. If TRUE then the intersection of distinct words (or letters) will be taken into account

letters

either TRUE or FALSE. If TRUE then the intersection of letters in the text sequences will be computed


Method ratio_intersect()

Usage
text_intersect$ratio_intersect(distinct = FALSE, letters = FALSE)
Arguments
distinct

either TRUE or FALSE. If TRUE then the intersection of distinct words (or letters) will be taken into account

letters

either TRUE or FALSE. If TRUE then the intersection of letters in the text sequences will be computed


Method clone()

The objects of this class are cloneable with this method.

Usage
text_intersect$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

https://www.kaggle.com/c/home-depot-product-search-relevance/discussion/20427 by Igor Buinyi

Examples

library(textTinyR)

tok1 = list(c('compare', 'this', 'text'),

            c('and', 'this', 'text'))

tok2 = list(c('with', 'another', 'set'),

            c('of', 'text', 'documents'))


init = text_intersect$new(tok1, tok2)


init$count_intersect(distinct = TRUE, letters = FALSE)


init$ratio_intersect(distinct = FALSE, letters = TRUE)

token statistics

Description

token statistics

token statistics

Usage

# utl <- token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL,

#                               file_delimiter = ' ', n_gram_delimiter = "_")

Details

the path_2vector function returns the words of a folder or file to a vector ( using the file_delimiter to input the data ). Usage: read a vocabulary from a text file

the freq_distribution function returns a named-unsorted vector frequency_distribution in R for EITHER a folder, a file OR a character string vector. A specific subset of the result can be retrieved using the print_frequency function

the count_character function returns the number of characters for each word of the corpus for EITHER a folder, a file OR a character string vector. A specific number of character words can be retrieved using the print_count_character function

the collocation_words function returns a co-occurence frequency table for n-grams for EITHER a folder, a file OR a character string vector. A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ( http://nlp.stanford.edu/fsnlp/promo/colloc.pdf, page 172 ). The input to the function should be text n-grams separated by a delimiter (for instance 3- or 4-ngrams ). I can retrieve a specific frequency table by using the print_collocations function

the string_dissimilarity_matrix function returns a string-dissimilarity-matrix using either the dice, levenshtein or cosine distance. The input can be a character string vector only. In case that the method is dice then the dice-coefficient (similarity) is calculated between two strings for a specific number of character n-grams ( dice_n_gram ).

the look_up_table returns a look-up-list where the list-names are the n-grams and the list-vectors are the words associated with those n-grams. The words for each n-gram can be retrieved using the print_words_lookup_tbl function. The input can be a character string vector only.

Methods

token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, file_delimiter = ' ', n_gram_delimiter = "_")
--------------
path_2vector()
--------------
freq_distribution()
--------------
print_frequency(subset = NULL)
--------------
count_character()
--------------
print_count_character(number = NULL)
--------------
collocation_words()
--------------
print_collocations(word = NULL)
--------------
string_dissimilarity_matrix(dice_n_gram = 2, method = "dice", split_separator = " ", dice_thresh = 1.0, upper = TRUE, diagonal = TRUE, threads = 1)
--------------
look_up_table(n_grams = NULL)
--------------
print_words_lookup_tbl(n_gram = NULL)

Methods

Public methods


Method new()

Usage
token_stats$new(
  x_vec = NULL,
  path_2folder = NULL,
  path_2file = NULL,
  file_delimiter = "\n",
  n_gram_delimiter = "_"
)
Arguments
x_vec

either NULL or a string character vector

path_2folder

either NULL or a valid path to a folder (each file in the folder should include words separated by a delimiter)

path_2file

either NULL or a valid path to a file

file_delimiter

either NULL or a character string specifying the file delimiter

n_gram_delimiter

either NULL or a character string specifying the n-gram delimiter. It is used in the collocation_words function


Method path_2vector()

Usage
token_stats$path_2vector()

Method freq_distribution()

Usage
token_stats$freq_distribution()

Method print_frequency()

Usage
token_stats$print_frequency(subset = NULL)
Arguments
subset

either NULL or a vector specifying the subset of data to keep (number of rows of the print_frequency function)


Method count_character()

Usage
token_stats$count_character()

Method print_count_character()

Usage
token_stats$print_count_character(number = NULL)
Arguments
number

a numeric value for the print_count_character function. All words with number of characters equal to the number parameter will be returned.


Method collocation_words()

Usage
token_stats$collocation_words()

Method print_collocations()

Usage
token_stats$print_collocations(word = NULL)
Arguments
word

a character string for the print_collocations and print_prob_next functions


Method string_dissimilarity_matrix()

Usage
token_stats$string_dissimilarity_matrix(
  dice_n_gram = 2,
  method = "dice",
  split_separator = " ",
  dice_thresh = 1,
  upper = TRUE,
  diagonal = TRUE,
  threads = 1
)
Arguments
dice_n_gram

a numeric value specifying the n-gram for the dice method of the string_dissimilarity_matrix function

method

a character string specifying the method to use in the string_dissimilarity_matrix function. One of dice, levenshtein or cosine.

split_separator

a character string specifying the string split separator if method equal cosine in the string_dissimilarity_matrix function. The cosine method uses sentences, so for a sentence : "this_is_a_word_sentence" the split_separator should be "_"

dice_thresh

a float number to use to threshold the data if method is dice in the string_dissimilarity_matrix function. It takes values between 0.0 and 1.0. The closer the thresh is to 0.0 the more values of the dissimilarity matrix will take the value of 1.0.

upper

either TRUE or FALSE. If TRUE then both lower and upper parts of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the upper part will be filled with NA's

diagonal

either TRUE or FALSE. If TRUE then the diagonal of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the diagonal will be filled with NA's

threads

a numeric value specifying the number of cores to use in parallel in the string_dissimilarity_matrix function


Method look_up_table()

Usage
token_stats$look_up_table(n_grams = NULL)
Arguments
n_grams

a numeric value specifying the n-grams in the look_up_table function


Method print_words_lookup_tbl()

Usage
token_stats$print_words_lookup_tbl(n_gram = NULL)
Arguments
n_gram

a character string specifying the n-gram to use in the print_words_lookup_tbl function


Method clone()

The objects of this class are cloneable with this method.

Usage
token_stats$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

library(textTinyR)

expl = c('one_word_token', 'two_words_token', 'three_words_token', 'four_words_token')

tk <- token_stats$new(x_vec = expl, path_2folder = NULL, path_2file = NULL)

#-------------------------
# frequency distribution:
#-------------------------

tk$freq_distribution()

# tk$print_frequency()


#------------------
# count characters:
#------------------

cnt <- tk$count_character()

# tk$print_count_character(number = 4)


#----------------------
# collocation of words:
#----------------------

col <- tk$collocation_words()

# tk$print_collocations(word = 'five')


#-----------------------------
# string dissimilarity matrix:
#-----------------------------

dism <- tk$string_dissimilarity_matrix(method = 'levenshtein')


#---------------------
# build a look-up-table:
#---------------------

lut <- tk$look_up_table(n_grams = 3)

# tk$print_words_lookup_tbl(n_gram = 'e_w')

String tokenization and transformation ( character string or path to a file )

Description

String tokenization and transformation ( character string or path to a file )

Usage

tokenize_transform_text(
  object = NULL,
  batches = NULL,
  read_file_delimiter = "\n",
  to_lower = FALSE,
  to_upper = FALSE,
  utf_locale = "",
  remove_char = "",
  remove_punctuation_string = FALSE,
  remove_punctuation_vector = FALSE,
  remove_numbers = FALSE,
  trim_token = FALSE,
  split_string = FALSE,
  split_separator = " \r\n\t.,;:()?!//",
  remove_stopwords = FALSE,
  language = "english",
  min_num_char = 1,
  max_num_char = Inf,
  stemmer = NULL,
  min_n_gram = 1,
  max_n_gram = 1,
  skip_n_gram = 1,
  skip_distance = 0,
  n_gram_delimiter = " ",
  concat_delimiter = NULL,
  path_2folder = "",
  stemmer_ngram = 4,
  stemmer_gamma = 0,
  stemmer_truncate = 3,
  stemmer_batches = 1,
  threads = 1,
  vocabulary_path_file = NULL,
  verbose = FALSE
)

Arguments

object

either a character string (text data) or a character-string-path to a file (for big .txt files it's recommended to use a path to a file).

batches

a numeric value. If the batches parameter is not NULL then the object parameter should be a valid path to a file and the path_2folder parameter should be a valid path to a folder. The batches parameter should be used in case of small to medium data sets (for zero memory consumption). For big data sets the big_tokenize_transform R6 class and especially the big_text_tokenizer function should be used.

read_file_delimiter

the delimiter to use when the input file will be red (for instance a tab-delimiter or a new-line delimiter).

to_lower

either TRUE or FALSE. If TRUE the character string will be converted to lower case

to_upper

either TRUE or FALSE. If TRUE the character string will be converted to upper case

utf_locale

the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases.

remove_char

a character string with specific characters that should be removed from the text file. If the remove_char is "" then no removal of characters take place

remove_punctuation_string

either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)

remove_punctuation_vector

either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)

remove_numbers

either TRUE or FALSE. If TRUE then any numbers in the character string will be removed

trim_token

either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right)

split_string

either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.

split_separator

a character string specifying the character delimiter(s)

remove_stopwords

either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded.

language

a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu

min_num_char

an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned

max_num_char

an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000)

stemmer

a character string specifying the stemming method. One of the following porter2_stemmer, ngram_sequential, ngram_overlap. See details for more information.

min_n_gram

an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.

max_n_gram

an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.

skip_n_gram

an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1.

skip_distance

an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.

n_gram_delimiter

a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)

concat_delimiter

either NULL or a character string specifying the delimiter to use in order to concatenate the end-vector of character strings to a single character string (recommended in case that the end-vector should be saved to a file)

path_2folder

a character string specifying the path to the folder where the file(s) will be saved

stemmer_ngram

a numeric value greater than 1. Applies to both ngram_sequential and ngram_overlap methods. In case of ngram_sequential the first n characters will be picked, whereas in the case of ngram_overlap the overlapping stemmer_ngram characters will be build.

stemmer_gamma

a float number greater or equal to 0.0. Applies only to ngram_sequential. Is a threshold value, which defines how much frequency deviation of two N-grams is acceptable. It is kept either zero or to a minimum value.

stemmer_truncate

a numeric value greater than 0. Applies only to ngram_sequential. The ngram_sequential is modified to use relative frequencies (float numbers between 0.0 and 1.0 for the ngrams of a specific word in the corpus) and the stemmer_truncate parameter controls the number of rounding digits for the ngrams of the word. The main purpose was to give the same relative frequency to words appearing approximately the same on the corpus.

stemmer_batches

a numeric value greater than 0. Applies only to ngram_sequential. Splits the corpus into batches with the option to run the batches in multiple threads.

threads

an integer specifying the number of cores to run in parallel

vocabulary_path_file

either NULL or a character string specifying the output path to a file where the vocabulary should be saved once the text is tokenized

verbose

either TRUE or FALSE. If TRUE then information will be printed out

Details

It is memory efficient to read the data using a path file in case of a big file, rather than importing the data in the R-session and then calling the tokenize_transform_text function.

It is memory efficient to give a path_2folder in case that a big file should be saved, rather than return the vector of all character strings in the R-session.

The skip-grams are a generalization of n-grams in which the components (typically words) need not to be consecutive in the text under consideration, but may leave gaps that are skipped over. They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis.

Many character string pre-processing functions (such as the utf-locale or the split-string function ) are based on the boost library ( https://www.boost.org/ ).

Stemming of the english language is done using the porter2-stemmer, for details see https://github.com/smassung/porter2_stemmer

N-gram stemming is language independent and supported by the following two functions:

ngram_overlap

The ngram_overlap stemming method is based on N-Gram Morphemes for Retrieval, Paul McNamee and James Mayfield, http://clef.isti.cnr.it/2007/working_notes/mcnameeCLEF2007.pdf

ngram_sequential

The ngram_sequential stemming method is a modified version based on Generation, Implementation and Appraisal of an N-gram based Stemming Algorithm, B. P. Pande, Pawan Tamta, H. S. Dhami, https://arxiv.org/pdf/1312.4824.pdf

The list of stop-words in the available languages was downloaded from the following link, https://github.com/6/stopwords-json

Value

a character vector

Examples

library(textTinyR)

token_str = "CONVERT to lower, remove.. punctuation11234, trim token and split "

res = tokenize_transform_text(object = token_str, to_lower = TRUE, split_string = TRUE)

String tokenization and transformation ( vector of documents )

Description

String tokenization and transformation ( vector of documents )

Usage

tokenize_transform_vec_docs(
  object = NULL,
  as_token = FALSE,
  to_lower = FALSE,
  to_upper = FALSE,
  utf_locale = "",
  remove_char = "",
  remove_punctuation_string = FALSE,
  remove_punctuation_vector = FALSE,
  remove_numbers = FALSE,
  trim_token = FALSE,
  split_string = FALSE,
  split_separator = " \r\n\t.,;:()?!//",
  remove_stopwords = FALSE,
  language = "english",
  min_num_char = 1,
  max_num_char = Inf,
  stemmer = NULL,
  min_n_gram = 1,
  max_n_gram = 1,
  skip_n_gram = 1,
  skip_distance = 0,
  n_gram_delimiter = " ",
  concat_delimiter = NULL,
  path_2folder = "",
  threads = 1,
  vocabulary_path_file = NULL,
  verbose = FALSE
)

Arguments

object

a character string vector of documents

as_token

if TRUE then the output of the function is a list of (split) token. Otherwise is a vector of character strings (sentences)

to_lower

either TRUE or FALSE. If TRUE the character string will be converted to lower case

to_upper

either TRUE or FALSE. If TRUE the character string will be converted to upper case

utf_locale

the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases.

remove_char

a character string with specific characters that should be removed from the text file. If the remove_char is "" then no removal of characters take place

remove_punctuation_string

either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)

remove_punctuation_vector

either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)

remove_numbers

either TRUE or FALSE. If TRUE then any numbers in the character string will be removed

trim_token

either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right)

split_string

either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.

split_separator

a character string specifying the character delimiter(s)

remove_stopwords

either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded.

language

a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu

min_num_char

an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned

max_num_char

an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000)

stemmer

a character string specifying the stemming method. Available method is the porter2_stemmer. See details for more information.

min_n_gram

an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.

max_n_gram

an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.

skip_n_gram

an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1.

skip_distance

an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.

n_gram_delimiter

a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)

concat_delimiter

either NULL or a character string specifying the delimiter to use in order to concatenate the end-vector of character strings to a single character string (recommended in case that the end-vector should be saved to a file)

path_2folder

a character string specifying the path to the folder where the file(s) will be saved

threads

an integer specifying the number of cores to run in parallel

vocabulary_path_file

either NULL or a character string specifying the output path to a file where the vocabulary should be saved once the text is tokenized

verbose

either TRUE or FALSE. If TRUE then information will be printed out

Details

It is memory efficient to give a path_2folder in case that a big file should be saved, rather than return the vector of all character strings in the R-session.

The skip-grams are a generalization of n-grams in which the components (typically words) need not to be consecutive in the text under consideration, but may leave gaps that are skipped over. They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis.

Many character string pre-processing functions (such as the utf-locale or the split-string function ) are based on the boost library ( https://www.boost.org/ ).

Stemming of the english language is done using the porter2-stemmer, for details see https://github.com/smassung/porter2_stemmer

The list of stop-words in the available languages was downloaded from the following link, https://github.com/6/stopwords-json

Value

a character vector

Examples

library(textTinyR)

token_doc_vec = c("CONVERT to lower", "remove.. punctuation11234", "trim token and split ")

res = tokenize_transform_vec_docs(object = token_doc_vec, to_lower = TRUE, split_string = TRUE)

utf-locale for the available languages

Description

utf-locale for the available languages

Usage

utf_locale(language = "english")

Arguments

language

a character string specifying the language for which the utf-locale should be returned

Details

This is a limited list of language-locale. The locale depends mostly on the text input.

Value

a utf locale

Examples

library(textTinyR)

utf_locale(language = "english")

returns the vocabulary counts for small or medium ( xml and not only ) files

Description

returns the vocabulary counts for small or medium ( xml and not only ) files

Usage

vocabulary_parser(
  input_path_file = NULL,
  start_query = NULL,
  end_query = NULL,
  vocabulary_path_file = NULL,
  min_lines = 1,
  trimmed_line = FALSE,
  to_lower = FALSE,
  to_upper = FALSE,
  utf_locale = "",
  max_num_char = Inf,
  remove_char = "",
  remove_punctuation_string = FALSE,
  remove_punctuation_vector = FALSE,
  remove_numbers = FALSE,
  trim_token = FALSE,
  split_string = FALSE,
  split_separator = " \r\n\t.,;:()?!//",
  remove_stopwords = FALSE,
  language = "english",
  min_num_char = 1,
  stemmer = NULL,
  min_n_gram = 1,
  max_n_gram = 1,
  skip_n_gram = 1,
  skip_distance = 0,
  n_gram_delimiter = " ",
  threads = 1,
  verbose = FALSE
)

Arguments

input_path_file

a character string specifying a valid path to the input file

start_query

a character string. The start_query is the first word of the subset of the data and should appear frequently at the beginning of each line in the text file.

end_query

a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file.

vocabulary_path_file

a character string specifying the output file where the vocabulary should be saved (after tokenization and transformation is applied).

min_lines

a numeric value specifying the minimum number of lines. For instance if min_lines = 2, then only subsets of text with more than 1 lines will be kept.

trimmed_line

either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query

to_lower

either TRUE or FALSE. If TRUE the character string will be converted to lower case

to_upper

either TRUE or FALSE. If TRUE the character string will be converted to upper case

utf_locale

the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases.

max_num_char

an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000)

remove_char

a character string with specific characters that should be removed from the text file. If the remove_char is "" then no removal of characters take place

remove_punctuation_string

either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)

remove_punctuation_vector

either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)

remove_numbers

either TRUE or FALSE. If TRUE then any numbers in the character string will be removed

trim_token

either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right)

split_string

either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.

split_separator

a character string specifying the character delimiter(s)

remove_stopwords

either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded.

language

a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu

min_num_char

an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned

stemmer

a character string specifying the stemming method. Available method is the porter2_stemmer. See details for more information.

min_n_gram

an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.

max_n_gram

an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.

skip_n_gram

an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1.

skip_distance

an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.

n_gram_delimiter

a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)

threads

an integer specifying the number of cores to run in parallel

verbose

either TRUE or FALSE. If TRUE then information will be printed in the console

Details

The text file should have a structure (such as an xml-structure), so that subsets can be extracted using the start_query and end_query parameters

For big files the vocabulary_accumulator method of the big_tokenize_transform class is appropriate

Stemming of the english language is done using the porter2-stemmer, for details see https://github.com/smassung/porter2_stemmer

Examples

## Not run: 

library(textTinyR)

 vps = vocabulary_parser(input_path_file = '/folder/input_data.txt',
                         start_query = 'start_word', end_query = 'end_word',
                         vocabulary_path_file = '/folder/vocab.txt',
                         to_lower = TRUE, split_string = TRUE)

## End(Not run)