---
title: "Functionality of the fastText R package"
author: "Lampros Mouselimis"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Functionality of the fastText R package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
This vignette explains the functionality of the fastText R package. This R package is an interface to the [fasttext library](https://github.com/facebookresearch/fastText) for efficient learning of word representations and sentence classification. The following functions are included,
| fastText |
| :----------------------------------: | :---------------------------------------------------------------------------: |
| **fasttext_interface** | Interface for the fasttext library |
| **plot_progress_logs** | Plot the progress of loss, learning-rate and word-counts |
| **printAnalogiesUsage** | Print Usage Information when the command equals to 'analogies' |
| **printDumpUsage** | Print Usage Information when the command equals to 'dump' |
| **printNNUsage** | Print Usage Information when the command equals to 'nn' |
| **printPredictUsage** | Print Usage Information when the command equals to 'predict' or 'predict-prob'|
| **printPrintNgramsUsage** | Print Usage Information when the command equals to 'print-ngrams' |
| **printPrintSentenceVectorsUsage** | Print Usage Information when the command equals to 'print-sentence-vectors' |
| **printPrintWordVectorsUsage** | Print Usage Information when the command equals to 'print-word-vectors' |
| **printQuantizeUsage** | Print Usage Information when the command equals to 'quantize' |
| **printTestLabelUsage** | Print Usage Information when the command equals to 'test-label' |
| **printTestUsage** | Print Usage Information when the command equals to 'test' |
| **printUsage** | Print Usage Information for all parameters |
| **print_parameters** | Print the parameters for a specific command |
I'll explain each function separately based on example data. More information can be found in the package documentation.
#### print_parameters
This function prints information about the default parameters for a specific 'command'. The 'command' can be for instance *supervised*, *skipgram*, *cbow* etc.,
```R
library(fastText)
print_parameters(command = "supervised")
Empty input or output path.
The following arguments are mandatory:
-input training file path
-output output file path
The following arguments are optional:
-verbose verbosity level [2]
The following arguments for the dictionary are optional:
-minCount minimal number of word occurences [1]
-minCountLabel minimal number of label occurences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [0]
-maxn max length of char ngram [0]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
The following arguments for training are optional:
-lr learning rate [0.1]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim size of word vectors [100]
-ws size of the context window [5]
-epoch number of epochs [5]
-neg number of negatives sampled [5]
-loss loss function {ns, hs, softmax, one-vs-all} [softmax]
-thread number of threads [12]
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [false]
The following arguments for quantization are optional:
-cutoff number of words and ngrams to retain [0]
-retrain whether embeddings are finetuned if a cutoff is applied [false]
-qnorm whether the norm is quantized separately [false]
-qout whether the classifier is quantized [false]
-dsub size of each sub-vector [2]
Error in give_args_fasttext(args = c("fasttext", command)) :
EXIT_FAILURE -- args.cc file -- Args::parseArgs function
```
#### Print Usage Functions
Each one of the functions which includes the words *print* and *Usage* allows a user to print information for this specific function. For instance,
```R
printPredictUsage()
usage: fasttext predict[-prob] [] []
model filename
test data filename (if -, read from stdin)
(optional; 1 by default) predict top k labels
(optional; 0.0 by default) probability threshold
```
#### fasttext_interface
This function allows the user to run the various methods included in the [fasttext library](https://github.com/facebookresearch/fastText) from within R. The data that I'll use in the following code snippets can be downloaded as a .zip file (named as **fastText_data**) from my [Github repository](https://github.com/mlampros/DataSets). The user should then unzip the file and make the extracted folder his / hers default directory (using the base R function *setwd()*) before running the following code chunks.
```R
setwd('fastText_data') # make the extracted data the default directory
#------
# cbow
#------
library(fastText)
list_params = list(command = 'cbow',
lr = 0.1,
dim = 50,
input = "example_text.txt",
output = file.path(tempdir(), 'word_vectors'),
verbose = 2,
thread = 1)
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'cbow_logs.txt'),
MilliSecs = 5,
remove_previous_file = TRUE,
print_process_time = TRUE)
Read 0M words
Number of words: 8
Number of labels: 0
Progress: 100.0% words/sec/thread: 2933 lr: 0.000000 loss: 4.060542 ETA: 0h 0m
time to complete : 3.542332 secs
```
**The data is saved in the specified *tempdir()* folder for illustration purposes. The user is advised to specify his / her own folder.**
```R
#-----------
# supervised
#-----------
list_params = list(command = 'supervised',
lr = 0.1,
dim = 50,
input = file.path("cooking.stackexchange", "cooking.train"),
output = file.path(tempdir(), 'model_cooking'),
verbose = 2,
thread = 4)
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'sup_logs.txt'),
MilliSecs = 5,
remove_previous_file = TRUE,
print_process_time = TRUE)
Read 0M words
Number of words: 14543
Number of labels: 735
Progress: 100.0% words/sec/thread: 63282 lr: 0.000000 loss: 10.049338 ETA: 0h 0m
time to complete : 3.449003 secs
```
The user has here also the option to plot the progress of *loss*, *learning-rate* and *word-counts*,
```R
res = plot_progress_logs(path = file.path(tempdir(), 'sup_logs.txt'),
plot = TRUE)
dim(res)
```
![](progress_fasttext.png)
The verbosity for the logs-file depends on,
* the *'MilliSecs'* (higher leads to fewer logs-lines in the file) and
* the *'thread'* (greater than 1 fewer logs-lines in the file)
parameters.
The next command can be utilized to *'predict'* new data based on the output model,
```R
#-------------------
# 'predict' function
#-------------------
list_params = list(command = 'predict',
model = file.path(tempdir(), 'model_cooking.bin'),
test_data = file.path('cooking.stackexchange', 'cooking.valid'),
k = 1,
th = 0.0)
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'predict_valid.txt'))
```
These output predictions will be of the following form '__label__food-safety' , where each line will represent a new label (number of lines of the input data must match the number of lines of the output data). With the *'predict-prob'* command someone can obtain the probabilities of the labels as well,
```R
#------------------------
# 'predict-prob' function
#------------------------
list_params = list(command = 'predict-prob',
model = file.path(tempdir(), 'model_cooking.bin'),
test_data = file.path('cooking.stackexchange', 'cooking.valid'),
k = 1,
th = 0.0)
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'predict_valid_prob.txt'))
```
Using *'predict-prob'* the output predictions will be of the following form '__label__baking 0.0282927'
Once the model was trained, someone can evaluate it by computing the precision and recall at 'k' on a test set. The 'test' command just prints the metrics in the R session,
```R
#----------------
# 'test' function
#----------------
list_params = list(command = 'test',
model = file.path(tempdir(), 'model_cooking.bin'),
test_data = file.path('cooking.stackexchange', 'cooking.valid'),
k = 1,
th = 0.0)
res = fasttext_interface(list_params)
N 3000
P@1 0.138
R@1 0.060
```
whereas the 'test-label' command allows the user to save,
```R
#----------------------
# 'test-label' function
#----------------------
list_params = list(command = 'test-label',
model = file.path(tempdir(), 'model_cooking.bin'),
test_data = file.path('cooking.stackexchange', 'cooking.valid'),
k = 1,
th = 0.0)
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'test_label_valid.txt'))
```
the output to the *'test_label_valid.txt'* file, which includes the 'Precision' and 'Recall' for each unique label on the data set (*'cooking.stackexchange.txt'*). That means the number of rows of the *'test_label_valid.txt'* must be equal to the unique labels in the *'cooking.stackexchange.txt'* data set. This can be verified using the following code snippet,
```R
st_dat = read.delim(file.path("cooking.stackexchange", "cooking.stackexchange.txt"),
stringsAsFactors = FALSE)
res_stackexch = unlist(lapply(1:nrow(st_dat), function(y)
strsplit(st_dat[y, ], " ")[[1]][which(sapply(strsplit(st_dat[y, ], " ")[[1]], function(x)
substr(x, 1, 9) == "__label__") == T)])
)
test_label_valid = read.table(file.path(tempdir(), 'test_label_valid.txt'),
quote="\"", comment.char="")
# number of unique labels of data equal to the rows of the 'test_label_valid.txt' file
length(unique(res_stackexch)) == nrow(test_label_valid)
[1] TRUE
head(test_label_valid)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 F1-Score : 0.234244 Precision : 0.139535 Recall : 0.729167 __label__baking
2 F1-Score : 0.227746 Precision : 0.132571 Recall : 0.807377 __label__food-safety
3 F1-Score : 0.058824 Precision : 0.750000 Recall : 0.030612 __label__substitutions
4 F1-Score : 0.000000 Precision : -------- Recall : 0.000000 __label__equipment
5 F1-Score : 0.017699 Precision : 1.000000 Recall : 0.008929 __label__bread
6 F1-Score : 0.000000 Precision : -------- Recall : 0.000000 __label__chicken
.....
```
The user can also *'quantize'* a supervised model to reduce its memory usage with the following command,
```R
#---------------------
# 'quantize' function
#---------------------
list_params = list(command = 'quantize',
input = file.path(tempdir(), 'model_cooking.bin'),
output = file.path(tempdir(), 'model_cooking'))
res = fasttext_interface(list_params)
print(list.files(tempdir(), pattern = '.ftz'))
[1] "model_cooking.ftz"
```
The quantize function is currenlty (as of 01/02/2019) [single-threaded](https://github.com/facebookresearch/fastText/issues/353#issuecomment-342501742).
Based on the *'queries.txt'* text file the user can save the word vectors to a file using the following command ( one vector per line ),
```R
#----------------------------
# print-word-vectors function
#----------------------------
list_params = list(command = 'print-word-vectors',
model = file.path(tempdir(), 'model_cooking.bin'))
res = fasttext_interface(list_params,
path_input = 'queries.txt',
path_output = file.path(tempdir(), 'word_vecs_queries.txt'))
```
To compute vector representations of sentences or paragraphs use the following command,
```R
#--------------------------------
# print-sentence-vectors function
#--------------------------------
library(fastText)
list_params = list(command = 'print-sentence-vectors',
model = file.path(tempdir(), 'model_cooking.bin'))
res = fasttext_interface(list_params,
path_input = 'text_sentence.txt',
path_output = file.path(tempdir(), 'word_sentence_queries.txt'))
```
Be aware that for the *'print-sentence-vectors'* the *'word_sentence_queries.txt'* file should consist of sentences of the following form,
```R
How much does potato starch affect a cheese sauce recipe
Dangerous pathogens capable of growing in acidic environments
How do I cover up the white spots on my cast iron stove
How do I cover up the white spots on my cast iron stove
Michelin Three Star Restaurant but if the chef is not there
......
```
Therefore each line should end in **EOS (end of sentence )** and that if at the end of the file a **newline** exists then the function will return an additional word vector. Thus the user should **make sure** that the input file does **not** include empty lines at the end of the file.
The *'print-ngrams'* command prints the n-grams of a word in the R session or saves the n-grams to a file. But first the user should save the model and word-vectors with n-grams enabled (*minn*, *maxn* parameters)
```R
#----------------------------------------
# 'skipgram' function with n-gram enabled
#----------------------------------------
list_params = list(command = 'skipgram',
lr = 0.1,
dim = 50,
input = "example_text.txt",
output = file.path(tempdir(), 'word_vectors'),
verbose = 2,
thread = 1,
minn = 2,
maxn = 2)
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'skipgram_logs.txt'),
MilliSecs = 5)
#-----------------------
# 'print-ngram' function
#-----------------------
list_params = list(command = 'print-ngrams',
model = file.path(tempdir(), 'word_vectors.bin'),
word = 'word')
# save output to file
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'ngrams.txt'))
# print output to console
res = fasttext_interface(list_params,
path_output = "")
# truncated output for the 'word' query
#--------------------------------------
0.00749 0.00720 0.01171 0.01258 ......
```
The *'nn'* command returns the nearest neighbors for a specific word based on the input model,
```R
#--------------
# 'nn' function
#--------------
list_params = list(command = 'nn',
model = file.path(tempdir(), 'model_cooking.bin'),
k = 5,
query_word = 'sauce')
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'nearest.txt'))
# 'nearest.txt'
#--------------
rice 0.804595
0.799858
Vide 0.78893
store 0.788918
cheese 0.785977
```
The *'analogies'* command works for triplets of words (separated by whitespace) and returns 'k' rows for each line (triplet) of the input file (separated by an empty line),
```R
#---------------------
# 'analogies' function
#---------------------
list_params = list(command = 'analogies',
model = file.path(tempdir(), 'model_cooking.bin'),
k = 5)
res = fasttext_interface(list_params,
path_input = 'analogy_queries.txt',
path_output = file.path(tempdir(), 'analogies_output.txt'))
# 'analogies_output.txt'
#-----------------------
batter 0.857213
I 0.854491
recipe? 0.851498
substituted 0.845269
flour 0.842508
covered 0.808651
calls 0.801348
fresh 0.800051
cold 0.797468
always 0.793695
.............
```
Finally, the *'dump'* function takes as *'option'* one of the *'args'*, *'dict'*, *'input'* or *'output'* and dumps the output to a text file,
```R
#--------------
# dump function
#--------------
list_params = list(command = 'dump',
model = file.path(tempdir(), 'model_cooking.bin'),
option = 'args')
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'dump_data.txt'),
remove_previous_file = TRUE)
# 'dump_data.txt'
#----------------
dim 50
ws 5
epoch 5
minCount 1
neg 5
wordNgrams 1
loss softmax
model sup
bucket 0
minn 0
maxn 0
lrUpdateRate 100
t 0.00010
```
| |