Title: | Text Processing for Small or Big Data Files |
---|---|
Description: | It offers functions for splitting, parsing, tokenizing and creating a vocabulary for big text data files. Moreover, it includes functions for building a document-term matrix and extracting information from those (term-associations, most frequent terms). It also embodies functions for calculating token statistics (collocations, look-up tables, string dissimilarities) and functions to work with sparse matrices. Lastly, it includes functions for Word Vector Representations (i.e. 'GloVe', 'fasttext') and incorporates functions for the calculation of (pairwise) text document dissimilarities. The source code is based on 'C++11' and exported in R through the 'Rcpp', 'RcppArmadillo' and 'BH' packages. |
Authors: | Lampros Mouselimis [aut, cre] |
Maintainer: | Lampros Mouselimis <[email protected]> |
License: | GPL-3 |
Version: | 1.1.8 |
Built: | 2024-11-21 05:37:07 UTC |
Source: | https://github.com/mlampros/texttinyr |
Compute batches
batch_compute(n_rows, n_batches)
batch_compute(n_rows, n_batches)
n_rows |
a numeric specifying the number of rows |
n_batches |
a numeric specifying the number of output batches |
a list
library(textTinyR) btch = batch_compute(n_rows = 1000, n_batches = 10)
library(textTinyR) btch = batch_compute(n_rows = 1000, n_batches = 10)
String tokenization and transformation for big data sets
String tokenization and transformation for big data sets
# utl <- big_tokenize_transform$new(verbose = FALSE)
# utl <- big_tokenize_transform$new(verbose = FALSE)
the big_text_splitter function splits a text file into sub-text-files using either the batches parameter (big-text-splitter-bytes) or both the batches and the end_query parameter (big-text-splitter-query). The end_query parameter (if not NULL) should be a character string specifying a word that appears repeatedly at the end of each line in the text file.
the big_text_parser function parses text files from an input folder and saves those processed files to an output folder. The big_text_parser is appropriate for files with a structure using the start- and end- query parameters.
the big_text_tokenizer function tokenizes and transforms the text files of a folder and saves those files to either a folder or a single file. There is also the option to save a frequency vocabulary of those transformed tokens to a file.
the vocabulary_accumulator function takes the resulted vocabulary files of the big_text_tokenizer and returns the vocabulary sums sorted in decreasing order. The parameter max_num_chars limits the number of the corpus using the number of characters of each word.
The ngram_sequential or ngram_overlap stemming method applies to each single batch and not to the whole corpus of the text file. Thus, it is possible that the stems of the same words for randomly selected batches might differ.
big_tokenize_transform$new(verbose = FALSE)
--------------
big_text_splitter(input_path_file = NULL, output_path_folder = NULL, end_query = NULL, batches = NULL, trimmed_line = FALSE)
--------------
big_text_parser(input_path_folder = NULL, output_path_folder = NULL, start_query = NULL, end_query = NULL, min_lines = 1, trimmed_line = FALSE)
--------------
big_text_tokenizer(input_path_folder = NULL, batches = NULL, read_file_delimiter = " ", to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " .,;:()?!", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", concat_delimiter = NULL, path_2folder = "", stemmer_ngram = 4, stemmer_gamma = 0.0, stemmer_truncate = 3, stemmer_batches = 1, threads = 1, save_2single_file = FALSE, increment_batch_nr = 1, vocabulary_path_folder = NULL)
--------------
vocabulary_accumulator(input_path_folder = NULL, vocabulary_path_file = NULL, max_num_chars = 100)
new()
big_tokenize_transform$new(verbose = FALSE)
verbose
either TRUE or FALSE. If TRUE then information will be printed in the console
big_text_splitter()
big_tokenize_transform$big_text_splitter( input_path_file = NULL, output_path_folder = NULL, end_query = NULL, batches = NULL, trimmed_line = FALSE )
input_path_file
a character string specifying the path to the input file
output_path_folder
a character string specifying the folder where the output files should be saved
end_query
a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file.
batches
a numeric value specifying the number of batches to use. The batches will be used to split the initial data into subsets. Those subsets will be either saved in files (big_text_splitter function) or will be used internally for low memory processing (big_text_tokenizer function).
trimmed_line
either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query
big_text_parser()
big_tokenize_transform$big_text_parser( input_path_folder = NULL, output_path_folder = NULL, start_query = NULL, end_query = NULL, min_lines = 1, trimmed_line = FALSE )
input_path_folder
a character string specifying the folder where the input files are saved
output_path_folder
a character string specifying the folder where the output files should be saved
start_query
a character string. The start_query is the first word of the subset of the data and should appear frequently at the beginning of each line int the text file.
end_query
a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file.
min_lines
a numeric value specifying the minimum number of lines. For instance if min_lines = 2, then only subsets of text with more than 1 lines will be kept.
trimmed_line
either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query
big_text_tokenizer()
big_tokenize_transform$big_text_tokenizer( input_path_folder = NULL, batches = NULL, read_file_delimiter = "\n", to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " \r\n\t.,;:()?!//", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", concat_delimiter = NULL, path_2folder = "", stemmer_ngram = 4, stemmer_gamma = 0, stemmer_truncate = 3, stemmer_batches = 1, threads = 1, save_2single_file = FALSE, increment_batch_nr = 1, vocabulary_path_folder = NULL )
input_path_folder
a character string specifying the folder where the input files are saved
batches
a numeric value specifying the number of batches to use. The batches will be used to split the initial data into subsets. Those subsets will be either saved in files (big_text_splitter function) or will be used internally for low memory processing (big_text_tokenizer function).
read_file_delimiter
the delimiter to use when the input file will be red (for instance a tab-delimiter or a new-line delimiter).
to_lower
either TRUE or FALSE. If TRUE the character string will be converted to lower case
to_upper
either TRUE or FALSE. If TRUE the character string will be converted to upper case
utf_locale
the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases.
remove_char
a character string with specific characters that should be removed from the text file. If the remove_char is "" then no removal of characters take place
remove_punctuation_string
either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)
remove_punctuation_vector
either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)
remove_numbers
either TRUE or FALSE. If TRUE then any numbers in the character string will be removed
trim_token
either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right)
split_string
either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.
split_separator
a character string specifying the character delimiter(s)
remove_stopwords
either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded.
language
a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu
min_num_char
an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned
max_num_char
an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000)
stemmer
a character string specifying the stemming method. One of the following porter2_stemmer, ngram_sequential, ngram_overlap. See details for more information.
min_n_gram
an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.
max_n_gram
an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.
skip_n_gram
an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1.
skip_distance
an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.
n_gram_delimiter
a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)
concat_delimiter
either NULL or a character string specifying the delimiter to use in order to concatenate the end-vector of character strings to a single character string (recommended in case that the end-vector should be saved to a file)
path_2folder
a character string specifying the path to the folder where the file(s) will be saved
stemmer_ngram
a numeric value greater than 1. Applies to both ngram_sequential and ngram_overlap methods. In case of ngram_sequential the first stemmer_ngram characters will be picked, whereas in the case of ngram_overlap the overlapping stemmer_ngram characters will be build.
stemmer_gamma
a float number greater or equal to 0.0. Applies only to ngram_sequential. Is a threshold value, which defines how much frequency deviation of two N-grams is acceptable. It is kept either zero or to a minimum value.
stemmer_truncate
a numeric value greater than 0. Applies only to ngram_sequential. The ngram_sequential is modified to use relative frequencies (float numbers between 0.0 and 1.0 for the ngrams of a specific word in the corpus) and the stemmer_truncate parameter controls the number of rounding digits for the ngrams of the word. The main purpose was to give the same relative frequency to words appearing approximately the same on the corpus.
stemmer_batches
a numeric value greater than 0. Applies only to ngram_sequential. Splits the corpus into batches with the option to run the batches in multiple threads.
threads
an integer specifying the number of cores to run in parallel
save_2single_file
either TRUE or FALSE. If TRUE then the output data will be saved in a single file. Otherwise the data will be saved in multiple files with incremented enumeration
increment_batch_nr
a numeric value. The enumeration of the output files will start from the increment_batch_nr. If the save_2single_file parameter is TRUE then the increment_batch_nr parameter won't be taken into consideration.
vocabulary_path_folder
either NULL or a character string specifying the output folder where the vocabulary batches should be saved (after tokenization and transformation is applied). Applies to the big_text_tokenizer method.
vocabulary_accumulator()
big_tokenize_transform$vocabulary_accumulator( input_path_folder = NULL, vocabulary_path_file = NULL, max_num_chars = 100 )
input_path_folder
a character string specifying the folder where the input files are saved
vocabulary_path_file
either NULL or a character string specifying the output file where the vocabulary should be saved (after tokenization and transformation is applied). Applies to the vocabulary_accumulator method.
max_num_chars
a numeric value to limit the words of the output vocabulary to a maximum number of characters (applies to the vocabulary_accumulator function)
clone()
The objects of this class are cloneable with this method.
big_tokenize_transform$clone(deep = FALSE)
deep
Whether to make a deep clone.
## Not run: library(textTinyR) fs <- big_tokenize_transform$new(verbose = FALSE) #--------------- # file splitter: #--------------- fs$big_text_splitter(input_path_file = "input.txt", output_path_folder = "/folder/output/", end_query = "endword", batches = 5, trimmed_line = FALSE) #------------- # file parser: #------------- fs$big_text_parser(input_path_folder = "/folder/output/", output_path_folder = "/folder/parser/", start_query = "startword", end_query = "endword", min_lines = 1, trimmed_line = TRUE) #---------------- # file tokenizer: #---------------- fs$big_text_tokenizer(input_path_folder = "/folder/parser/", batches = 5, split_string=TRUE, to_lower = TRUE, trim_token = TRUE, max_num_char = 100, remove_stopwords = TRUE, stemmer = "porter2_stemmer", threads = 1, path_2folder="/folder/output_token/", vocabulary_path_folder="/folder/VOCAB/") #------------------- # vocabulary counts: #------------------- fs$vocabulary_accumulator(input_path_folder = "/folder/VOCAB/", vocabulary_path_file = "/folder/vocab.txt", max_num_chars = 50) ## End(Not run)
## Not run: library(textTinyR) fs <- big_tokenize_transform$new(verbose = FALSE) #--------------- # file splitter: #--------------- fs$big_text_splitter(input_path_file = "input.txt", output_path_folder = "/folder/output/", end_query = "endword", batches = 5, trimmed_line = FALSE) #------------- # file parser: #------------- fs$big_text_parser(input_path_folder = "/folder/output/", output_path_folder = "/folder/parser/", start_query = "startword", end_query = "endword", min_lines = 1, trimmed_line = TRUE) #---------------- # file tokenizer: #---------------- fs$big_text_tokenizer(input_path_folder = "/folder/parser/", batches = 5, split_string=TRUE, to_lower = TRUE, trim_token = TRUE, max_num_char = 100, remove_stopwords = TRUE, stemmer = "porter2_stemmer", threads = 1, path_2folder="/folder/output_token/", vocabulary_path_folder="/folder/VOCAB/") #------------------- # vocabulary counts: #------------------- fs$vocabulary_accumulator(input_path_folder = "/folder/VOCAB/", vocabulary_path_file = "/folder/vocab.txt", max_num_chars = 50) ## End(Not run)
bytes converter of a text file ( KB, MB or GB )
bytes_converter(input_path_file = NULL, unit = "MB")
bytes_converter(input_path_file = NULL, unit = "MB")
input_path_file |
a character string specifying the path to the input file |
unit |
a character string specifying the unit. One of KB, MB, GB |
a number
## Not run: library(textTinyR) bc = bytes_converter(input_path_file = 'some_file.txt', unit = "MB") ## End(Not run)
## Not run: library(textTinyR) bc = bytes_converter(input_path_file = 'some_file.txt', unit = "MB") ## End(Not run)
Frequencies of an existing cluster object
cluster_frequency(tokenized_list_text, cluster_vector, verbose = FALSE)
cluster_frequency(tokenized_list_text, cluster_vector, verbose = FALSE)
tokenized_list_text |
a list of tokenized text documents. This can be the result of the textTinyR::tokenize_transform_vec_docs function with the as_token parameter set to TRUE (the token object of the output) |
cluster_vector |
a numeric vector. This can be the result of the ClusterR::KMeans_rcpp function (the clusters object of the output) |
verbose |
either TRUE or FALSE. If TRUE then information will be printed out in the R session. |
This function takes a list of tokenized text and a numeric vector of clusters and returns the sorted frequency of each cluster. The length of the tokenized_list_text object must be equal to the length of the cluster_vector object
a list of data.tables
library(textTinyR) tok_lst = list(c('the', 'the', 'tokens', 'of', 'first', 'document'), c('the', 'tokens', 'of', 'of', 'second', 'document'), c('the', 'tokens', 'of', 'third', 'third', 'document')) vec_clust = rep(1:6, 3) res = cluster_frequency(tok_lst, vec_clust)
library(textTinyR) tok_lst = list(c('the', 'the', 'tokens', 'of', 'first', 'document'), c('the', 'tokens', 'of', 'of', 'second', 'document'), c('the', 'tokens', 'of', 'third', 'third', 'document')) vec_clust = rep(1:6, 3) res = cluster_frequency(tok_lst, vec_clust)
Cosine similarity for text documents
COS_TEXT( text_vector1 = NULL, text_vector2 = NULL, threads = 1, separator = " " )
COS_TEXT( text_vector1 = NULL, text_vector2 = NULL, threads = 1, separator = " " )
text_vector1 |
a character string vector representing text documents (it should have the same length as the text_vector2) |
text_vector2 |
a character string vector representing text documents (it should have the same length as the text_vector1) |
threads |
a numeric value specifying the number of cores to run in parallel |
separator |
specifies the separator used between words of each character string in the text vectors |
The function calculates the cosine distance between pairs of text sequences of two character string vectors
a numeric vector
library(textTinyR) vec1 = c('use this', 'function to compute the') vec2 = c('cosine distance', 'between text sequences') out = COS_TEXT(text_vector1 = vec1, text_vector2 = vec2, separator = " ")
library(textTinyR) vec1 = c('use this', 'function to compute the') vec2 = c('cosine distance', 'between text sequences') out = COS_TEXT(text_vector1 = vec1, text_vector2 = vec2, separator = " ")
cosine distance of two character strings (each string consists of more than one words)
cosine_distance(sentence1, sentence2, split_separator = " ")
cosine_distance(sentence1, sentence2, split_separator = " ")
sentence1 |
a character string consisting of multiple words |
sentence2 |
a character string consisting of multiple words |
split_separator |
a character string specifying the delimiter(s) to split the sentence |
a float number
library(textTinyR) sentence1 = 'this is one sentence' sentence2 = 'this is a similar sentence' cds = cosine_distance(sentence1, sentence2)
library(textTinyR) sentence1 = 'this is one sentence' sentence2 = 'this is a similar sentence' cds = cosine_distance(sentence1, sentence2)
Number of rows of a file
Count_Rows(PATH, verbose = FALSE)
Count_Rows(PATH, verbose = FALSE)
PATH |
a character string specifying the path to a file |
verbose |
either TRUE or FALSE |
This function returns the number of rows for a file. It doesn't load the data in memory.
a numeric value
library(textTinyR) PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR") num_rows = Count_Rows(PATH)
library(textTinyR) PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR") num_rows = Count_Rows(PATH)
convert a dense matrix to a sparse matrix
dense_2sparse(dense_mat)
dense_2sparse(dense_mat)
dense_mat |
a dense matrix |
a sparse matrix
library(textTinyR) tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10) sp_mat = dense_2sparse(tmp)
library(textTinyR) tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10) sp_mat = dense_2sparse(tmp)
dice similarity of words using n-grams
dice_distance(word1, word2, n_grams = 2)
dice_distance(word1, word2, n_grams = 2)
word1 |
a character string |
word2 |
a character string |
n_grams |
a value specifying the consecutive n-grams of the words |
a float number
library(textTinyR) word1 = 'one_word' word2 = 'two_words' dts = dice_distance(word1, word2, n_grams = 2)
library(textTinyR) word1 = 'one_word' word2 = 'two_words' dts = dice_distance(word1, word2, n_grams = 2)
dimensions of a word vectors file
dims_of_word_vecs(input_file = NULL, read_delimiter = "\n")
dims_of_word_vecs(input_file = NULL, read_delimiter = "\n")
input_file |
a character string specifying a valid path to a text file |
read_delimiter |
a character string specifying the row delimiter of the text file |
This function takes a valid path to a file and a file delimiter as input and estimates the dimensions of the word vectors by using the first row of the file.
a numeric value
library(textTinyR) PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR") dimensions = dims_of_word_vecs(input_file = PATH)
library(textTinyR) PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR") dimensions = dims_of_word_vecs(input_file = PATH)
Conversion of text documents to word-vector-representation features ( Doc2Vec )
Conversion of text documents to word-vector-representation features ( Doc2Vec )
# utl <- Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, # print_every_rows = 10000, verbose = FALSE, # copy_data = FALSE)
# utl <- Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, # print_every_rows = 10000, verbose = FALSE, # copy_data = FALSE)
the pre_processed_wv method should be used after the initialization of the Doc2Vec class, if the copy_data parameter is set to TRUE, in order to inspect the pre-processed word-vectors.
The global_term_weights method is part of the sparse_term_matrix R6 class of the textTinyR package. One can come to the correct global_term_weights by using the sparse_term_matrix class and by setting the tf_idf parameter to FALSE and the normalize parameter to NULL. In Doc2Vec class, if method equals to idf then the global_term_weights parameter should not be equal to NULL.
Explanation of the various methods :
Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be accumulated to a vector equal to the length of the wordvector (INITIAL_WORD_VECTOR). Then a scalar will be computed using this INITIAL_WORD_VECTOR in the following way : the INITIAL_WORD_VECTOR will be raised to the power of 2.0, then the resulted wordvector will be summed and the square-root will be calculated. The INITIAL_WORD_VECTOR will be divided by the resulted scalar
Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be first min-max normalized and then will be accumulated to a vector equal to the length of the initial wordvector
Assuming that a single sublist of the token list will be taken into consideration : the word-vector of each term in the sublist will be multiplied with the corresponding idf of the global weights term
There might be slight differences in the output data for each method depending on the input value of the copy_data parameter (if it's either TRUE or FALSE).
a matrix
Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE)
--------------
doc2vec_methods(method = "sum_sqrt", global_term_weights = NULL, threads = 1)
--------------
pre_processed_wv()
new()
Doc2Vec$new( token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE )
token_list
either NULL or a list of tokenized text documents
word_vector_FILE
a valid path to a text file, where the word-vectors are saved
print_every_rows
a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function especially in case of big files.
verbose
either TRUE or FALSE. If TRUE then information will be printed out in the R session.
copy_data
either TRUE or FALSE. If FALSE then a pointer will be created and no copy of the initial data takes place (memory efficient especially for big datasets). This is an alternative way to pre-process the data.
doc2vec_methods()
Doc2Vec$doc2vec_methods( method = "sum_sqrt", global_term_weights = NULL, threads = 1 )
method
a character string specifying the method to use. One of sum_sqrt, min_max_norm or idf. See the details section for more information.
global_term_weights
either NULL or the output of the global_term_weights method of the textTinyR package. See the details section for more information.
threads
a numeric value specifying the number of cores to run in parallel
pre_processed_wv()
Doc2Vec$pre_processed_wv()
clone()
The objects of this class are cloneable with this method.
Doc2Vec$clone(deep = FALSE)
deep
Whether to make a deep clone.
library(textTinyR) #--------------------------------- # tokenized text in form of a list #--------------------------------- tok_text = list(c('the', 'result', 'of'), c('doc2vec', 'are', 'vector', 'features')) #------------------------- # path to the word vectors #------------------------- PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR") init = Doc2Vec$new(token_list = tok_text, word_vector_FILE = PATH) out = init$doc2vec_methods(method = "sum_sqrt")
library(textTinyR) #--------------------------------- # tokenized text in form of a list #--------------------------------- tok_text = list(c('the', 'result', 'of'), c('doc2vec', 'are', 'vector', 'features')) #------------------------- # path to the word vectors #------------------------- PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR") init = Doc2Vec$new(token_list = tok_text, word_vector_FILE = PATH) out = init$doc2vec_methods(method = "sum_sqrt")
Jaccard or Dice similarity for text documents
JACCARD_DICE( token_list1 = NULL, token_list2 = NULL, method = "jaccard", threads = 1 )
JACCARD_DICE( token_list1 = NULL, token_list2 = NULL, method = "jaccard", threads = 1 )
token_list1 |
a list of tokenized text documents (it should have the same length as the token_list2) |
token_list2 |
a list of tokenized text documents (it should have the same length as the token_list1) |
method |
a character string specifying the similarity metric. One of 'jaccard', 'dice' |
threads |
a numeric value specifying the number of cores to run in parallel |
The function calculates either the jaccard or the dice distance between pairs of tokenized text of two lists
a numeric vector
library(textTinyR) lst1 = list(c('use', 'this', 'function', 'to'), c('either', 'compute', 'the', 'jaccard')) lst2 = list(c('or', 'the', 'dice', 'distance'), c('for', 'two', 'same', 'sized', 'lists')) out = JACCARD_DICE(token_list1 = lst1, token_list2 = lst2, method = 'jaccard', threads = 1)
library(textTinyR) lst1 = list(c('use', 'this', 'function', 'to'), c('either', 'compute', 'the', 'jaccard')) lst2 = list(c('or', 'the', 'dice', 'distance'), c('for', 'two', 'same', 'sized', 'lists')) out = JACCARD_DICE(token_list1 = lst1, token_list2 = lst2, method = 'jaccard', threads = 1)
levenshtein distance of two words
levenshtein_distance(word1, word2)
levenshtein_distance(word1, word2)
word1 |
a character string |
word2 |
a character string |
a float number
library(textTinyR) word1 = 'one_word' word2 = 'two_words' lvs = levenshtein_distance(word1, word2)
library(textTinyR) word1 = 'one_word' word2 = 'two_words' lvs = levenshtein_distance(word1, word2)
load a sparse matrix in binary format
load_sparse_binary(file_name = "save_sparse.mat")
load_sparse_binary(file_name = "save_sparse.mat")
file_name |
a character string specifying the binary file |
loads a sparse matrix from a file
## Not run: library(textTinyR) load_sparse_binary(file_name = "save_sparse.mat") ## End(Not run)
## Not run: library(textTinyR) load_sparse_binary(file_name = "save_sparse.mat") ## End(Not run)
sparsity percentage of a sparse matrix
matrix_sparsity(sparse_matrix)
matrix_sparsity(sparse_matrix)
sparse_matrix |
a sparse matrix |
a numeric value (percentage)
library(textTinyR) tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10) sp_mat = dense_2sparse(tmp) dbl = matrix_sparsity(sp_mat)
library(textTinyR) tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10) sp_mat = dense_2sparse(tmp) dbl = matrix_sparsity(sp_mat)
read a specific number of characters from a text file
read_characters(input_file = NULL, characters = 100, write_2file = "")
read_characters(input_file = NULL, characters = 100, write_2file = "")
input_file |
a character string specifying a valid path to a text file |
characters |
a numeric value specifying the number of characters to read |
write_2file |
either an empty string ("") or a character string specifying a valid output file to write the subset of the input file |
## Not run: library(textTinyR) txfl = read_characters(input_file = 'input.txt', characters = 100) ## End(Not run)
## Not run: library(textTinyR) txfl = read_characters(input_file = 'input.txt', characters = 100) ## End(Not run)
read a specific number of rows from a text file
read_rows( input_file = NULL, read_delimiter = "\n", rows = 100, write_2file = "" )
read_rows( input_file = NULL, read_delimiter = "\n", rows = 100, write_2file = "" )
input_file |
a character string specifying a valid path to a text file |
read_delimiter |
a character string specifying the row delimiter of the text file |
rows |
a numeric value specifying the number of rows to read |
write_2file |
either "" or a character string specifying a valid output file to write the subset of the input file |
## Not run: library(textTinyR) txfl = read_rows(input_file = 'input.txt', rows = 100) ## End(Not run)
## Not run: library(textTinyR) txfl = read_rows(input_file = 'input.txt', rows = 100) ## End(Not run)
save a sparse matrix in binary format
save_sparse_binary(sparse_matrix, file_name = "save_sparse.mat")
save_sparse_binary(sparse_matrix, file_name = "save_sparse.mat")
sparse_matrix |
a sparse matrix |
file_name |
a character string specifying the binary file |
writes the sparse matrix to a file
library(textTinyR) tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10) sp_mat = dense_2sparse(tmp) # save_sparse_binary(sp_mat, file_name = "save_sparse.mat")
library(textTinyR) tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10) sp_mat = dense_2sparse(tmp) # save_sparse_binary(sp_mat, file_name = "save_sparse.mat")
Exclude highly correlated predictors
select_predictors( response_vector, predictors_matrix, response_lower_thresh = 0.1, predictors_upper_thresh = 0.75, threads = 1, verbose = FALSE )
select_predictors( response_vector, predictors_matrix, response_lower_thresh = 0.1, predictors_upper_thresh = 0.75, threads = 1, verbose = FALSE )
response_vector |
a numeric vector (the length should be equal to the rows of the predictors_matrix parameter) |
predictors_matrix |
a numeric matrix (the rows should be equal to the length of the response_vector parameter) |
response_lower_thresh |
a numeric value. This parameter allows the user to keep all the predictors having a correlation with the response greater than the response_lower_thresh value. |
predictors_upper_thresh |
a numeric value. This parameter allows the user to keep all the predictors having a correlation comparing to the other predictors less than the predictors_upper_thresh value. |
threads |
a numeric value specifying the number of cores to run in parallel |
verbose |
either TRUE or FALSE. If TRUE then information will be printed out in the R session. |
The function works in the following way : The correlation of the predictors with the response is first calculated and the resulted correlations are sorted in decreasing order. Then iteratively predictors with correlation higher than the predictors_upper_thresh value are removed by favoring those predictors which are more correlated with the response variable. If the response_lower_thresh value is greater than 0.0 then only predictors having a correlation higher than or equal to the response_lower_thresh value will be kept, otherwise they will be excluded. This function returns the indices of the predictors and is useful in case of multicollinearity.
If during computation the correlation between the response variable and a potential predictor is equal to NA or +/- Inf, then a correlation of 0.0 will be assigned to this particular pair.
a vector of column-indices
library(textTinyR) set.seed(1) resp = runif(100) set.seed(2) col = runif(100) matr = matrix(c(col, col^4, col^6, col^8, col^10), nrow = 100, ncol = 5) out = select_predictors(resp, matr, predictors_upper_thresh = 0.75)
library(textTinyR) set.seed(1) resp = runif(100) set.seed(2) col = runif(100) matr = matrix(c(col, col^4, col^6, col^8, col^10), nrow = 100, ncol = 5) out = select_predictors(resp, matr, predictors_upper_thresh = 0.75)
RowMens and colMeans for a sparse matrix
sparse_Means(sparse_matrix, rowMeans = FALSE)
sparse_Means(sparse_matrix, rowMeans = FALSE)
sparse_matrix |
a sparse matrix |
rowMeans |
either TRUE or FALSE. If TRUE then the row-means will be calculated, otherwise the column-means |
a vector with either the row- or the column-sums of the matrix
library(textTinyR) tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10) sp_mat = dense_2sparse(tmp) spsm = sparse_Means(sp_mat, rowMeans = FALSE)
library(textTinyR) tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10) sp_mat = dense_2sparse(tmp) spsm = sparse_Means(sp_mat, rowMeans = FALSE)
RowSums and colSums for a sparse matrix
sparse_Sums(sparse_matrix, rowSums = FALSE)
sparse_Sums(sparse_matrix, rowSums = FALSE)
sparse_matrix |
a sparse matrix |
rowSums |
either TRUE or FALSE. If TRUE then the row-sums will be calculated, otherwise the column-sums |
a vector with either the row- or the column-sums of the matrix
library(textTinyR) tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10) sp_mat = dense_2sparse(tmp) spsm = sparse_Sums(sp_mat, rowSums = FALSE)
library(textTinyR) tmp = matrix(sample(0:1, 100, replace = TRUE), 10, 10) sp_mat = dense_2sparse(tmp) spsm = sparse_Sums(sp_mat, rowSums = FALSE)
Term matrices and statistics ( document-term-matrix, term-document-matrix)
Term matrices and statistics ( document-term-matrix, term-document-matrix)
# utl <- sparse_term_matrix$new(vector_data = NULL, file_data = NULL, # document_term_matrix = TRUE)
# utl <- sparse_term_matrix$new(vector_data = NULL, file_data = NULL, # document_term_matrix = TRUE)
the Term_Matrix function takes either a character vector of strings or a text file and after tokenization and transformation returns either a document-term-matrix or a term-document-matrix
the triplet_data function returns the triplet data, which is used internally (in c++), to construct the Term Matrix. The triplet data could be usefull for secondary purposes, such as in word vector representations.
the global_term_weights function returns a list of length two. The first sublist includes the terms and the second sublist the global-term-weights. The tf_idf parameter should be set to FALSE and the normalize parameter to NULL. This function is normally used in conjuction with word-vector-embeddings.
the Term_Matrix_Adjust function removes sparse terms from a sparse matrix using a sparsity threshold
the term_associations function finds the associations between the given terms (Terms argument) and all the other terms in the corpus by calculating their correlation. There is also the option to keep a specific number of terms from the output table using the keep_terms parameter.
the most_frequent_terms function returns the most frequent terms of the corpus using the output of the sparse matrix. The user has the option to keep a specific number of terms from the output table using the keep_terms parameter.
Stemming of the english language is done using the porter2-stemmer, for details see https://github.com/smassung/porter2_stemmer
sparse_term_matrix$new(vector_data = NULL, file_data = NULL, document_term_matrix = TRUE)
--------------
Term_Matrix(sort_terms = FALSE, to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " .,;:()?!", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", print_every_rows = 1000, normalize = NULL, tf_idf = FALSE, threads = 1, verbose = FALSE)
--------------
triplet_data()
--------------
global_term_weights()
--------------
Term_Matrix_Adjust(sparsity_thresh = 1.0)
--------------
term_associations(Terms = NULL, keep_terms = NULL, verbose = FALSE)
--------------
most_frequent_terms(keep_terms = NULL, threads = 1, verbose = FALSE)
new()
sparse_term_matrix$new( vector_data = NULL, file_data = NULL, document_term_matrix = TRUE )
vector_data
either NULL or a character vector of documents
file_data
either NULL or a valid character path to a text file
document_term_matrix
either TRUE or FALSE. If TRUE then a document-term-matrix will be returned, otherwise a term-document-matrix
Term_Matrix()
sparse_term_matrix$Term_Matrix( sort_terms = FALSE, to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " \r\n\t.,;:()?!//", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", print_every_rows = 1000, normalize = NULL, tf_idf = FALSE, threads = 1, verbose = FALSE )
sort_terms
either TRUE or FALSE specifying if the initial terms should be sorted ( so that the output sparse matrix is sorted in alphabetical order )
to_lower
either TRUE or FALSE. If TRUE the character string will be converted to lower case
to_upper
either TRUE or FALSE. If TRUE the character string will be converted to upper case
utf_locale
the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases.
remove_char
a string specifying the specific characters that should be removed from a text file. If the remove_char is "" then no removal of characters take place
remove_punctuation_string
either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)
remove_punctuation_vector
either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)
remove_numbers
either TRUE or FALSE. If TRUE then any numbers in the character string will be removed
trim_token
either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right)
split_string
either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.
split_separator
a character string specifying the character delimiter(s)
remove_stopwords
either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded.
language
a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu
min_num_char
an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned
max_num_char
an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000)
stemmer
a character string specifying the stemming method. Available method is the porter2_stemmer. See details for more information.
min_n_gram
an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.
max_n_gram
an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.
skip_n_gram
an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1.
skip_distance
an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.
n_gram_delimiter
a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)
print_every_rows
a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function in case of big files.
normalize
either NULL or one of 'l1' or 'l2' normalization.
tf_idf
either TRUE or FALSE. If TRUE then the term-frequency-inverse-document-frequency will be returned
threads
an integer specifying the number of cores to run in parallel
verbose
either TRUE or FALSE. If TRUE then information will be printed out
triplet_data()
sparse_term_matrix$triplet_data()
global_term_weights()
sparse_term_matrix$global_term_weights()
Term_Matrix_Adjust()
sparse_term_matrix$Term_Matrix_Adjust(sparsity_thresh = 1)
sparsity_thresh
a float number between 0.0 and 1.0 specifying the sparsity threshold in the Term_Matrix_Adjust function
term_associations()
sparse_term_matrix$term_associations( Terms = NULL, keep_terms = NULL, verbose = FALSE )
Terms
a character vector specifying the character strings for which the associations will be calculated ( term_associations function )
keep_terms
either NULL or a numeric value specifying the number of terms to keep ( both in term_associations and most_frequent_terms functions )
verbose
either TRUE or FALSE. If TRUE then information will be printed out
most_frequent_terms()
sparse_term_matrix$most_frequent_terms( keep_terms = NULL, threads = 1, verbose = FALSE )
keep_terms
either NULL or a numeric value specifying the number of terms to keep ( both in term_associations and most_frequent_terms functions )
threads
an integer specifying the number of cores to run in parallel
verbose
either TRUE or FALSE. If TRUE then information will be printed out
clone()
The objects of this class are cloneable with this method.
sparse_term_matrix$clone(deep = FALSE)
deep
Whether to make a deep clone.
## Not run: library(textTinyR) sm <- sparse_term_matrix$new(file_data = "/folder/my_data.txt", document_term_matrix = TRUE) #-------------- # term matrix : #-------------- sm$Term_Matrix(sort_terms = TRUE, to_lower = TRUE, trim_token = TRUE, split_string = TRUE, remove_stopwords = TRUE, normalize = 'l1', stemmer = 'porter2_stemmer', threads = 1 ) #--------------- # triplet data : #--------------- sm$triplet_data() #---------------------- # global-term-weights : #---------------------- sm$global_term_weights() #------------------------- # removal of sparse terms: #------------------------- sm$Term_Matrix_Adjust(sparsity_thresh = 0.995) #----------------------------------------------- # associations between terms of a sparse matrix: #----------------------------------------------- sm$term_associations(Terms = c("word", "sentence"), keep_terms = 10) #--------------------------------------------- # most frequent terms using the sparse matrix: #--------------------------------------------- sm$most_frequent_terms(keep_terms = 10, threads = 1) ## End(Not run)
## Not run: library(textTinyR) sm <- sparse_term_matrix$new(file_data = "/folder/my_data.txt", document_term_matrix = TRUE) #-------------- # term matrix : #-------------- sm$Term_Matrix(sort_terms = TRUE, to_lower = TRUE, trim_token = TRUE, split_string = TRUE, remove_stopwords = TRUE, normalize = 'l1', stemmer = 'porter2_stemmer', threads = 1 ) #--------------- # triplet data : #--------------- sm$triplet_data() #---------------------- # global-term-weights : #---------------------- sm$global_term_weights() #------------------------- # removal of sparse terms: #------------------------- sm$Term_Matrix_Adjust(sparsity_thresh = 0.995) #----------------------------------------------- # associations between terms of a sparse matrix: #----------------------------------------------- sm$term_associations(Terms = c("word", "sentence"), keep_terms = 10) #--------------------------------------------- # most frequent terms using the sparse matrix: #--------------------------------------------- sm$most_frequent_terms(keep_terms = 10, threads = 1) ## End(Not run)
Dissimilarity calculation of text documents
TEXT_DOC_DISSIM( first_matr = NULL, second_matr = NULL, method = "euclidean", batches = NULL, threads = 1, verbose = FALSE )
TEXT_DOC_DISSIM( first_matr = NULL, second_matr = NULL, method = "euclidean", batches = NULL, threads = 1, verbose = FALSE )
first_matr |
a numeric matrix where each row represents a text document ( has same dimensions as the second_matr ) |
second_matr |
a numeric matrix where each row represents a text document ( has same dimensions as the first_matr ) |
method |
a dissimilarity metric in form of a character string. One of euclidean, manhattan, chebyshev, canberra, braycurtis, pearson_correlation, cosine, simple_matching_coefficient, hamming, jaccard_coefficient, Rao_coefficient |
batches |
a numeric value specifying the number of batches |
threads |
a numeric value specifying the number of cores to run in parallel |
verbose |
either TRUE or FALSE. If TRUE then information will be printed in the console |
Row-wise dissimilarity calculation of text documents. The text document sequences should be converted to numeric matrices using for instance LSI (Latent Semantic Indexing). If the numeric matrices are too big to be pre-processed, then one should use the batches parameter to split the data in batches before applying one of the dissimilarity metrics. For parallelization (threads) OpenMP will be used.
a numeric vector
## Not run: library(textTinyR) # example input LSI matrices (see details section) #------------------------------------------------- set.seed(1) LSI_matrix1 = matrix(runif(10000), 100, 100) set.seed(2) LSI_matrix2 = matrix(runif(10000), 100, 100) txt_out = TEXT_DOC_DISSIM(first_matr = LSI_matrix1, second_matr = LSI_matrix2, 'euclidean') ## End(Not run)
## Not run: library(textTinyR) # example input LSI matrices (see details section) #------------------------------------------------- set.seed(1) LSI_matrix1 = matrix(runif(10000), 100, 100) set.seed(2) LSI_matrix2 = matrix(runif(10000), 100, 100) txt_out = TEXT_DOC_DISSIM(first_matr = LSI_matrix1, second_matr = LSI_matrix2, 'euclidean') ## End(Not run)
text file parser
text_file_parser( input_path_file = NULL, output_path_file = "", start_query = NULL, end_query = NULL, min_lines = 1, trimmed_line = FALSE, verbose = FALSE )
text_file_parser( input_path_file = NULL, output_path_file = "", start_query = NULL, end_query = NULL, min_lines = 1, trimmed_line = FALSE, verbose = FALSE )
input_path_file |
either a path to an input file or a vector of character strings ( normally the latter would represent ordered lines of a text file in form of a character vector ) |
output_path_file |
either an empty character string ("") or a character string specifying a path to an output file ( it applies only if the input_path_file parameter is a valid path to a file ) |
start_query |
a character string or a vector of character strings. The start_query (if it's a single character string) is the first word of the subset of the data and should appear frequently at the beginning of each line in the text file. |
end_query |
a character string or a vector of character strings. The end_query (if it's a single character string) is the last word of the subset of the data and should appear frequently at the end of each line in the text file. |
min_lines |
a numeric value specifying the minimum number of lines ( applies only if the input_path_file is a valid path to a file) . For instance if min_lines = 2, then only subsets of text with more than 1 lines will be pre-processed. |
trimmed_line |
either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query |
verbose |
either TRUE or FALSE. If TRUE then information will be printed in the console |
The text file should have a structure (such as an xml-structure), so that subsets can be extracted using the start_query and end_query parameters ( the same applies in case of a vector of character strings)
## Not run: library(textTinyR) # In case that the 'input_path_file' is a valid path #--------------------------------------------------- fp = text_file_parser(input_path_file = '/folder/input_data.txt', output_path_file = '/folder/output_data.txt', start_query = 'word_a', end_query = 'word_w', min_lines = 1, trimmed_line = FALSE) # In case that the 'input_path_file' is a character vector of strings #-------------------------------------------------------------------- PATH_url = "https://FILE.xml" con = url(PATH_url, method = "libcurl") tmp_dat = read.delim(con, quote = "\"", comment.char = "", stringsAsFactors = FALSE) vec_docs = unlist(lapply(1:length(as.vector(tmp_dat[, 1])), function(x) trimws(tmp_dat[x, 1], which = "both"))) parse_data = text_file_parser(input_path_file = vec_docs, start_query = c("<query1>", "<query2>", "<query3>"), end_query = c("</query1>", "</query2>", "</query3>"), min_lines = 1, trimmed_line = TRUE) ## End(Not run)
## Not run: library(textTinyR) # In case that the 'input_path_file' is a valid path #--------------------------------------------------- fp = text_file_parser(input_path_file = '/folder/input_data.txt', output_path_file = '/folder/output_data.txt', start_query = 'word_a', end_query = 'word_w', min_lines = 1, trimmed_line = FALSE) # In case that the 'input_path_file' is a character vector of strings #-------------------------------------------------------------------- PATH_url = "https://FILE.xml" con = url(PATH_url, method = "libcurl") tmp_dat = read.delim(con, quote = "\"", comment.char = "", stringsAsFactors = FALSE) vec_docs = unlist(lapply(1:length(as.vector(tmp_dat[, 1])), function(x) trimws(tmp_dat[x, 1], which = "both"))) parse_data = text_file_parser(input_path_file = vec_docs, start_query = c("<query1>", "<query2>", "<query3>"), end_query = c("</query1>", "</query2>", "</query3>"), min_lines = 1, trimmed_line = TRUE) ## End(Not run)
intersection of words or letters in tokenized text
intersection of words or letters in tokenized text
# utl <- text_intersect$new(token_list1 = NULL, token_list2 = NULL)
# utl <- text_intersect$new(token_list1 = NULL, token_list2 = NULL)
This class includes methods for text or character intersection. If both distinct and letters are FALSE then the simple (count or ratio) word intersection will be computed.
a numeric vector
text_intersect$new(file_data = NULL)
--------------
count_intersect(distinct = FALSE, letters = FALSE)
--------------
ratio_intersect(distinct = FALSE, letters = FALSE)
new()
text_intersect$new(token_list1 = NULL, token_list2 = NULL)
token_list1
a list, where each sublist is a tokenized text sequence (token_list1 should be of same length with token_list2)
token_list2
a list, where each sublist is a tokenized text sequence (token_list2 should be of same length with token_list1)
count_intersect()
text_intersect$count_intersect(distinct = FALSE, letters = FALSE)
distinct
either TRUE or FALSE. If TRUE then the intersection of distinct words (or letters) will be taken into account
letters
either TRUE or FALSE. If TRUE then the intersection of letters in the text sequences will be computed
ratio_intersect()
text_intersect$ratio_intersect(distinct = FALSE, letters = FALSE)
distinct
either TRUE or FALSE. If TRUE then the intersection of distinct words (or letters) will be taken into account
letters
either TRUE or FALSE. If TRUE then the intersection of letters in the text sequences will be computed
clone()
The objects of this class are cloneable with this method.
text_intersect$clone(deep = FALSE)
deep
Whether to make a deep clone.
https://www.kaggle.com/c/home-depot-product-search-relevance/discussion/20427 by Igor Buinyi
library(textTinyR) tok1 = list(c('compare', 'this', 'text'), c('and', 'this', 'text')) tok2 = list(c('with', 'another', 'set'), c('of', 'text', 'documents')) init = text_intersect$new(tok1, tok2) init$count_intersect(distinct = TRUE, letters = FALSE) init$ratio_intersect(distinct = FALSE, letters = TRUE)
library(textTinyR) tok1 = list(c('compare', 'this', 'text'), c('and', 'this', 'text')) tok2 = list(c('with', 'another', 'set'), c('of', 'text', 'documents')) init = text_intersect$new(tok1, tok2) init$count_intersect(distinct = TRUE, letters = FALSE) init$ratio_intersect(distinct = FALSE, letters = TRUE)
token statistics
token statistics
# utl <- token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, # file_delimiter = ' ', n_gram_delimiter = "_")
# utl <- token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, # file_delimiter = ' ', n_gram_delimiter = "_")
the path_2vector function returns the words of a folder or file to a vector ( using the file_delimiter to input the data ). Usage: read a vocabulary from a text file
the freq_distribution function returns a named-unsorted vector frequency_distribution in R for EITHER a folder, a file OR a character string vector. A specific subset of the result can be retrieved using the print_frequency function
the count_character function returns the number of characters for each word of the corpus for EITHER a folder, a file OR a character string vector. A specific number of character words can be retrieved using the print_count_character function
the collocation_words function returns a co-occurence frequency table for n-grams for EITHER a folder, a file OR a character string vector. A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ( http://nlp.stanford.edu/fsnlp/promo/colloc.pdf, page 172 ). The input to the function should be text n-grams separated by a delimiter (for instance 3- or 4-ngrams ). I can retrieve a specific frequency table by using the print_collocations function
the string_dissimilarity_matrix function returns a string-dissimilarity-matrix using either the dice, levenshtein or cosine distance. The input can be a character string vector only. In case that the method is dice then the dice-coefficient (similarity) is calculated between two strings for a specific number of character n-grams ( dice_n_gram ).
the look_up_table returns a look-up-list where the list-names are the n-grams and the list-vectors are the words associated with those n-grams. The words for each n-gram can be retrieved using the print_words_lookup_tbl function. The input can be a character string vector only.
token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, file_delimiter = ' ', n_gram_delimiter = "_")
--------------
path_2vector()
--------------
freq_distribution()
--------------
print_frequency(subset = NULL)
--------------
count_character()
--------------
print_count_character(number = NULL)
--------------
collocation_words()
--------------
print_collocations(word = NULL)
--------------
string_dissimilarity_matrix(dice_n_gram = 2, method = "dice", split_separator = " ", dice_thresh = 1.0, upper = TRUE, diagonal = TRUE, threads = 1)
--------------
look_up_table(n_grams = NULL)
--------------
print_words_lookup_tbl(n_gram = NULL)
new()
token_stats$new( x_vec = NULL, path_2folder = NULL, path_2file = NULL, file_delimiter = "\n", n_gram_delimiter = "_" )
x_vec
either NULL or a string character vector
path_2folder
either NULL or a valid path to a folder (each file in the folder should include words separated by a delimiter)
path_2file
either NULL or a valid path to a file
file_delimiter
either NULL or a character string specifying the file delimiter
n_gram_delimiter
either NULL or a character string specifying the n-gram delimiter. It is used in the collocation_words function
path_2vector()
token_stats$path_2vector()
freq_distribution()
token_stats$freq_distribution()
print_frequency()
token_stats$print_frequency(subset = NULL)
subset
either NULL or a vector specifying the subset of data to keep (number of rows of the print_frequency function)
count_character()
token_stats$count_character()
print_count_character()
token_stats$print_count_character(number = NULL)
number
a numeric value for the print_count_character function. All words with number of characters equal to the number parameter will be returned.
collocation_words()
token_stats$collocation_words()
print_collocations()
token_stats$print_collocations(word = NULL)
word
a character string for the print_collocations and print_prob_next functions
string_dissimilarity_matrix()
token_stats$string_dissimilarity_matrix( dice_n_gram = 2, method = "dice", split_separator = " ", dice_thresh = 1, upper = TRUE, diagonal = TRUE, threads = 1 )
dice_n_gram
a numeric value specifying the n-gram for the dice method of the string_dissimilarity_matrix function
method
a character string specifying the method to use in the string_dissimilarity_matrix function. One of dice, levenshtein or cosine.
split_separator
a character string specifying the string split separator if method equal cosine in the string_dissimilarity_matrix function. The cosine method uses sentences, so for a sentence : "this_is_a_word_sentence" the split_separator should be "_"
dice_thresh
a float number to use to threshold the data if method is dice in the string_dissimilarity_matrix function. It takes values between 0.0 and 1.0. The closer the thresh is to 0.0 the more values of the dissimilarity matrix will take the value of 1.0.
upper
either TRUE or FALSE. If TRUE then both lower and upper parts of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the upper part will be filled with NA's
diagonal
either TRUE or FALSE. If TRUE then the diagonal of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the diagonal will be filled with NA's
threads
a numeric value specifying the number of cores to use in parallel in the string_dissimilarity_matrix function
look_up_table()
token_stats$look_up_table(n_grams = NULL)
n_grams
a numeric value specifying the n-grams in the look_up_table function
print_words_lookup_tbl()
token_stats$print_words_lookup_tbl(n_gram = NULL)
n_gram
a character string specifying the n-gram to use in the print_words_lookup_tbl function
clone()
The objects of this class are cloneable with this method.
token_stats$clone(deep = FALSE)
deep
Whether to make a deep clone.
library(textTinyR) expl = c('one_word_token', 'two_words_token', 'three_words_token', 'four_words_token') tk <- token_stats$new(x_vec = expl, path_2folder = NULL, path_2file = NULL) #------------------------- # frequency distribution: #------------------------- tk$freq_distribution() # tk$print_frequency() #------------------ # count characters: #------------------ cnt <- tk$count_character() # tk$print_count_character(number = 4) #---------------------- # collocation of words: #---------------------- col <- tk$collocation_words() # tk$print_collocations(word = 'five') #----------------------------- # string dissimilarity matrix: #----------------------------- dism <- tk$string_dissimilarity_matrix(method = 'levenshtein') #--------------------- # build a look-up-table: #--------------------- lut <- tk$look_up_table(n_grams = 3) # tk$print_words_lookup_tbl(n_gram = 'e_w')
library(textTinyR) expl = c('one_word_token', 'two_words_token', 'three_words_token', 'four_words_token') tk <- token_stats$new(x_vec = expl, path_2folder = NULL, path_2file = NULL) #------------------------- # frequency distribution: #------------------------- tk$freq_distribution() # tk$print_frequency() #------------------ # count characters: #------------------ cnt <- tk$count_character() # tk$print_count_character(number = 4) #---------------------- # collocation of words: #---------------------- col <- tk$collocation_words() # tk$print_collocations(word = 'five') #----------------------------- # string dissimilarity matrix: #----------------------------- dism <- tk$string_dissimilarity_matrix(method = 'levenshtein') #--------------------- # build a look-up-table: #--------------------- lut <- tk$look_up_table(n_grams = 3) # tk$print_words_lookup_tbl(n_gram = 'e_w')
String tokenization and transformation ( character string or path to a file )
tokenize_transform_text( object = NULL, batches = NULL, read_file_delimiter = "\n", to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " \r\n\t.,;:()?!//", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", concat_delimiter = NULL, path_2folder = "", stemmer_ngram = 4, stemmer_gamma = 0, stemmer_truncate = 3, stemmer_batches = 1, threads = 1, vocabulary_path_file = NULL, verbose = FALSE )
tokenize_transform_text( object = NULL, batches = NULL, read_file_delimiter = "\n", to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " \r\n\t.,;:()?!//", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", concat_delimiter = NULL, path_2folder = "", stemmer_ngram = 4, stemmer_gamma = 0, stemmer_truncate = 3, stemmer_batches = 1, threads = 1, vocabulary_path_file = NULL, verbose = FALSE )
object |
either a character string (text data) or a character-string-path to a file (for big .txt files it's recommended to use a path to a file). |
batches |
a numeric value. If the batches parameter is not NULL then the object parameter should be a valid path to a file and the path_2folder parameter should be a valid path to a folder. The batches parameter should be used in case of small to medium data sets (for zero memory consumption). For big data sets the big_tokenize_transform R6 class and especially the big_text_tokenizer function should be used. |
read_file_delimiter |
the delimiter to use when the input file will be red (for instance a tab-delimiter or a new-line delimiter). |
to_lower |
either TRUE or FALSE. If TRUE the character string will be converted to lower case |
to_upper |
either TRUE or FALSE. If TRUE the character string will be converted to upper case |
utf_locale |
the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases. |
remove_char |
a character string with specific characters that should be removed from the text file. If the remove_char is "" then no removal of characters take place |
remove_punctuation_string |
either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function) |
remove_punctuation_vector |
either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place) |
remove_numbers |
either TRUE or FALSE. If TRUE then any numbers in the character string will be removed |
trim_token |
either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right) |
split_string |
either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters. |
split_separator |
a character string specifying the character delimiter(s) |
remove_stopwords |
either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded. |
language |
a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu |
min_num_char |
an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned |
max_num_char |
an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000) |
stemmer |
a character string specifying the stemming method. One of the following porter2_stemmer, ngram_sequential, ngram_overlap. See details for more information. |
min_n_gram |
an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1. |
max_n_gram |
an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1. |
skip_n_gram |
an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1. |
skip_distance |
an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned. |
n_gram_delimiter |
a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases) |
concat_delimiter |
either NULL or a character string specifying the delimiter to use in order to concatenate the end-vector of character strings to a single character string (recommended in case that the end-vector should be saved to a file) |
path_2folder |
a character string specifying the path to the folder where the file(s) will be saved |
stemmer_ngram |
a numeric value greater than 1. Applies to both ngram_sequential and ngram_overlap methods. In case of ngram_sequential the first n characters will be picked, whereas in the case of ngram_overlap the overlapping stemmer_ngram characters will be build. |
stemmer_gamma |
a float number greater or equal to 0.0. Applies only to ngram_sequential. Is a threshold value, which defines how much frequency deviation of two N-grams is acceptable. It is kept either zero or to a minimum value. |
stemmer_truncate |
a numeric value greater than 0. Applies only to ngram_sequential. The ngram_sequential is modified to use relative frequencies (float numbers between 0.0 and 1.0 for the ngrams of a specific word in the corpus) and the stemmer_truncate parameter controls the number of rounding digits for the ngrams of the word. The main purpose was to give the same relative frequency to words appearing approximately the same on the corpus. |
stemmer_batches |
a numeric value greater than 0. Applies only to ngram_sequential. Splits the corpus into batches with the option to run the batches in multiple threads. |
threads |
an integer specifying the number of cores to run in parallel |
vocabulary_path_file |
either NULL or a character string specifying the output path to a file where the vocabulary should be saved once the text is tokenized |
verbose |
either TRUE or FALSE. If TRUE then information will be printed out |
It is memory efficient to read the data using a path file in case of a big file, rather than importing the data in the R-session and then calling the tokenize_transform_text function.
It is memory efficient to give a path_2folder in case that a big file should be saved, rather than return the vector of all character strings in the R-session.
The skip-grams are a generalization of n-grams in which the components (typically words) need not to be consecutive in the text under consideration, but may leave gaps that are skipped over. They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis.
Many character string pre-processing functions (such as the utf-locale or the split-string function ) are based on the boost library ( https://www.boost.org/ ).
Stemming of the english language is done using the porter2-stemmer, for details see https://github.com/smassung/porter2_stemmer
N-gram stemming is language independent and supported by the following two functions:
The ngram_overlap stemming method is based on N-Gram Morphemes for Retrieval, Paul McNamee and James Mayfield, http://clef.isti.cnr.it/2007/working_notes/mcnameeCLEF2007.pdf
The ngram_sequential stemming method is a modified version based on Generation, Implementation and Appraisal of an N-gram based Stemming Algorithm, B. P. Pande, Pawan Tamta, H. S. Dhami, https://arxiv.org/pdf/1312.4824.pdf
The list of stop-words in the available languages was downloaded from the following link, https://github.com/6/stopwords-json
a character vector
library(textTinyR) token_str = "CONVERT to lower, remove.. punctuation11234, trim token and split " res = tokenize_transform_text(object = token_str, to_lower = TRUE, split_string = TRUE)
library(textTinyR) token_str = "CONVERT to lower, remove.. punctuation11234, trim token and split " res = tokenize_transform_text(object = token_str, to_lower = TRUE, split_string = TRUE)
String tokenization and transformation ( vector of documents )
tokenize_transform_vec_docs( object = NULL, as_token = FALSE, to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " \r\n\t.,;:()?!//", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", concat_delimiter = NULL, path_2folder = "", threads = 1, vocabulary_path_file = NULL, verbose = FALSE )
tokenize_transform_vec_docs( object = NULL, as_token = FALSE, to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " \r\n\t.,;:()?!//", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", concat_delimiter = NULL, path_2folder = "", threads = 1, vocabulary_path_file = NULL, verbose = FALSE )
object |
a character string vector of documents |
as_token |
if TRUE then the output of the function is a list of (split) token. Otherwise is a vector of character strings (sentences) |
to_lower |
either TRUE or FALSE. If TRUE the character string will be converted to lower case |
to_upper |
either TRUE or FALSE. If TRUE the character string will be converted to upper case |
utf_locale |
the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases. |
remove_char |
a character string with specific characters that should be removed from the text file. If the remove_char is "" then no removal of characters take place |
remove_punctuation_string |
either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function) |
remove_punctuation_vector |
either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place) |
remove_numbers |
either TRUE or FALSE. If TRUE then any numbers in the character string will be removed |
trim_token |
either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right) |
split_string |
either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters. |
split_separator |
a character string specifying the character delimiter(s) |
remove_stopwords |
either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded. |
language |
a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu |
min_num_char |
an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned |
max_num_char |
an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000) |
stemmer |
a character string specifying the stemming method. Available method is the porter2_stemmer. See details for more information. |
min_n_gram |
an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1. |
max_n_gram |
an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1. |
skip_n_gram |
an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1. |
skip_distance |
an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned. |
n_gram_delimiter |
a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases) |
concat_delimiter |
either NULL or a character string specifying the delimiter to use in order to concatenate the end-vector of character strings to a single character string (recommended in case that the end-vector should be saved to a file) |
path_2folder |
a character string specifying the path to the folder where the file(s) will be saved |
threads |
an integer specifying the number of cores to run in parallel |
vocabulary_path_file |
either NULL or a character string specifying the output path to a file where the vocabulary should be saved once the text is tokenized |
verbose |
either TRUE or FALSE. If TRUE then information will be printed out |
It is memory efficient to give a path_2folder in case that a big file should be saved, rather than return the vector of all character strings in the R-session.
The skip-grams are a generalization of n-grams in which the components (typically words) need not to be consecutive in the text under consideration, but may leave gaps that are skipped over. They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis.
Many character string pre-processing functions (such as the utf-locale or the split-string function ) are based on the boost library ( https://www.boost.org/ ).
Stemming of the english language is done using the porter2-stemmer, for details see https://github.com/smassung/porter2_stemmer
The list of stop-words in the available languages was downloaded from the following link, https://github.com/6/stopwords-json
a character vector
library(textTinyR) token_doc_vec = c("CONVERT to lower", "remove.. punctuation11234", "trim token and split ") res = tokenize_transform_vec_docs(object = token_doc_vec, to_lower = TRUE, split_string = TRUE)
library(textTinyR) token_doc_vec = c("CONVERT to lower", "remove.. punctuation11234", "trim token and split ") res = tokenize_transform_vec_docs(object = token_doc_vec, to_lower = TRUE, split_string = TRUE)
utf-locale for the available languages
utf_locale(language = "english")
utf_locale(language = "english")
language |
a character string specifying the language for which the utf-locale should be returned |
This is a limited list of language-locale. The locale depends mostly on the text input.
a utf locale
library(textTinyR) utf_locale(language = "english")
library(textTinyR) utf_locale(language = "english")
returns the vocabulary counts for small or medium ( xml and not only ) files
vocabulary_parser( input_path_file = NULL, start_query = NULL, end_query = NULL, vocabulary_path_file = NULL, min_lines = 1, trimmed_line = FALSE, to_lower = FALSE, to_upper = FALSE, utf_locale = "", max_num_char = Inf, remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " \r\n\t.,;:()?!//", remove_stopwords = FALSE, language = "english", min_num_char = 1, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", threads = 1, verbose = FALSE )
vocabulary_parser( input_path_file = NULL, start_query = NULL, end_query = NULL, vocabulary_path_file = NULL, min_lines = 1, trimmed_line = FALSE, to_lower = FALSE, to_upper = FALSE, utf_locale = "", max_num_char = Inf, remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " \r\n\t.,;:()?!//", remove_stopwords = FALSE, language = "english", min_num_char = 1, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", threads = 1, verbose = FALSE )
input_path_file |
a character string specifying a valid path to the input file |
start_query |
a character string. The start_query is the first word of the subset of the data and should appear frequently at the beginning of each line in the text file. |
end_query |
a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file. |
vocabulary_path_file |
a character string specifying the output file where the vocabulary should be saved (after tokenization and transformation is applied). |
min_lines |
a numeric value specifying the minimum number of lines. For instance if min_lines = 2, then only subsets of text with more than 1 lines will be kept. |
trimmed_line |
either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query |
to_lower |
either TRUE or FALSE. If TRUE the character string will be converted to lower case |
to_upper |
either TRUE or FALSE. If TRUE the character string will be converted to upper case |
utf_locale |
the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases. |
max_num_char |
an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000) |
remove_char |
a character string with specific characters that should be removed from the text file. If the remove_char is "" then no removal of characters take place |
remove_punctuation_string |
either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function) |
remove_punctuation_vector |
either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place) |
remove_numbers |
either TRUE or FALSE. If TRUE then any numbers in the character string will be removed |
trim_token |
either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right) |
split_string |
either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters. |
split_separator |
a character string specifying the character delimiter(s) |
remove_stopwords |
either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded. |
language |
a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu |
min_num_char |
an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned |
stemmer |
a character string specifying the stemming method. Available method is the porter2_stemmer. See details for more information. |
min_n_gram |
an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1. |
max_n_gram |
an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1. |
skip_n_gram |
an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1. |
skip_distance |
an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned. |
n_gram_delimiter |
a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases) |
threads |
an integer specifying the number of cores to run in parallel |
verbose |
either TRUE or FALSE. If TRUE then information will be printed in the console |
The text file should have a structure (such as an xml-structure), so that subsets can be extracted using the start_query and end_query parameters
For big files the vocabulary_accumulator method of the big_tokenize_transform class is appropriate
Stemming of the english language is done using the porter2-stemmer, for details see https://github.com/smassung/porter2_stemmer
## Not run: library(textTinyR) vps = vocabulary_parser(input_path_file = '/folder/input_data.txt', start_query = 'start_word', end_query = 'end_word', vocabulary_path_file = '/folder/vocab.txt', to_lower = TRUE, split_string = TRUE) ## End(Not run)
## Not run: library(textTinyR) vps = vocabulary_parser(input_path_file = '/folder/input_data.txt', start_query = 'start_word', end_query = 'end_word', vocabulary_path_file = '/folder/vocab.txt', to_lower = TRUE, split_string = TRUE) ## End(Not run)