--- title: "Functionality of the fuzzywuzzyR package" author: "Lampros Mouselimis" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Functionality of the fuzzywuzzyR package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- I recently released an (other one) R package on CRAN - **fuzzywuzzyR** - which ports the [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) python library in R. "fuzzywuzzy does fuzzy string matching by using the Levenshtein Distance to calculate the differences between sequences (of character strings)."

The *fuzzywuzzyR* package includes R6-classes / functions for string matching,
##### **classes**
| FuzzExtract | FuzzMatcher | FuzzUtils | SequenceMatcher | | :-------------------------: | :----------------------------: | :-----------------------------: | :---------------------: | | Extract() | Partial_token_set_ratio() | Full_process() | ratio() | | ExtractBests() | Partial_token_sort_ratio() | Make_type_consistent() | quick_ratio() | | ExtractWithoutOrder() | Ratio() | Asciidammit() | real_quick_ratio() | | ExtractOne() | QRATIO() | Asciionly() | get_matching_blocks() | | | WRATIO() | Validate_string() | get_opcodes() | | | UWRATIO() | | | | | UQRATIO() | | | | | Token_sort_ratio() | | | | | Partial_ratio() | | | | | Token_set_ratio() | | |
##### **functions**
| GetCloseMatches() | | :---------------- |
The following code chunks / examples are part of the package documentation and give an idea on what can be done with the *fuzzywuzzyR* package,
##### *FuzzExtract*
Each one of the methods in the *FuzzExtract* class takes a *character string* and a *character string sequence* as input ( except for the *Dedupe* method which takes a string sequence only ) and given a *processor* and a *scorer* it returns one or more string match(es) and the corresponding score ( in the range 0 - 100 ). Information about the additional parameters (*limit*, *score_cutoff* and *threshold*) can be found in the package documentation,
```{r, eval = F, echo = T} library(fuzzywuzzyR) word = "new york jets" choices = c("Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys") #------------ # processor : #------------ init_proc = FuzzUtils$new() # initialization of FuzzUtils class to choose a processor PROC = init_proc$Full_process # processor-method PROC1 = tolower # base R function ( as an example for a processor ) #--------- # scorer : #--------- init_scor = FuzzMatcher$new() # initialization of the scorer class SCOR = init_scor$WRATIO # choosen scorer function init <- FuzzExtract$new() # Initialization of the FuzzExtract class init$Extract(string = word, sequence_strings = choices, processor = PROC, scorer = SCOR) ``` ```{r, eval = F, echo = T} # example output [[1]] [[1]][[1]] [1] "New York Jets" [[1]][[2]] [1] 100 [[2]] [[2]][[1]] [1] "New York Giants" [[2]][[2]] [1] 79 [[3]] [[3]][[1]] [1] "Atlanta Falcons" [[3]][[2]] [1] 29 [[4]] [[4]][[1]] [1] "Dallas Cowboys" [[4]][[2]] [1] 22 ``` ```{r, eval = F, echo = T} # extracts best matches (limited to 2 matches) init$ExtractBests(string = word, sequence_strings = choices, processor = PROC1, scorer = SCOR, score_cutoff = 0L, limit = 2L) ``` ```{r, eval = F, echo = T} [[1]] [[1]][[1]] [1] "New York Jets" [[1]][[2]] [1] 100 [[2]] [[2]][[1]] [1] "New York Giants" [[2]][[2]] [1] 79 ``` ```{r, eval = F, echo = T} # extracts matches without keeping the output order init$ExtractWithoutOrder(string = word, sequence_strings = choices, processor = PROC, scorer = SCOR, score_cutoff = 0L) ``` ```{r, eval = F, echo = T} [[1]] [[1]][[1]] [1] "Atlanta Falcons" [[1]][[2]] [1] 29 [[2]] [[2]][[1]] [1] "New York Jets" [[2]][[2]] [1] 100 [[3]] [[3]][[1]] [1] "New York Giants" [[3]][[2]] [1] 79 [[4]] [[4]][[1]] [1] "Dallas Cowboys" [[4]][[2]] [1] 22 ``` ```{r, eval = F, echo = T} # extracts first result init$ExtractOne(string = word, sequence_strings = choices, processor = PROC, scorer = SCOR, score_cutoff = 0L) ``` ```{r, eval = F, echo = T} [[1]] [1] "New York Jets" [[2]] [1] 100 ```
The *dedupe* method removes duplicates from a sequence of character strings using fuzzy string matching,
```{r, eval = F, echo = T} duplicat = c('Frodo Baggins', 'Tom Sawyer', 'Bilbo Baggin', 'Samuel L. Jackson', 'F. Baggins', 'Frody Baggins', 'Bilbo Baggins') init$Dedupe(contains_dupes = duplicat, threshold = 70L, scorer = SCOR) ``` ```{r, eval = F, echo = T} [1] "Frodo Baggins" "Samuel L. Jackson" "Bilbo Baggins" "Tom Sawyer" ```
##### *FuzzMatcher*
Each one of the methods in the *FuzzMatcher* class takes two *character strings* (string1, string2) as input and returns a score ( in range 0 to 100 ). Information about the additional parameters (*force_ascii*, *full_process* and *threshold*) can be found in the package documentation, ```{r, eval = F, echo = T} s1 = "Atlanta Falcons" s2 = "New York Jets" init = FuzzMatcher$new() initialization of FuzzMatcher class init$Partial_token_set_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE) # example output [1] 31 ``` ```{r, eval = F, echo = T} init$Partial_token_sort_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE) [1] 31 ``` ```{r, eval = F, echo = T} init$Ratio(string1 = s1, string2 = s2) [1] 21 ``` ```{r, eval = F, echo = T} init$QRATIO(string1 = s1, string2 = s2, force_ascii = TRUE) [1] 29 ``` ```{r, eval = F, echo = T} init$WRATIO(string1 = s1, string2 = s2, force_ascii = TRUE) [1] 29 ``` ```{r, eval = F, echo = T} init$UWRATIO(string1 = s1, string2 = s2) [1] 29 ``` ```{r, eval = F, echo = T} init$UQRATIO(string1 = s1, string2 = s2) [1] 29 ``` ```{r, eval = F, echo = T} init$Token_sort_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE) [1] 29 ``` ```{r, eval = F, echo = T} init$Partial_ratio(string1 = s1, string2 = s2) [1] 23 ``` ```{r, eval = F, echo = T} init$Token_set_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE) [1] 29 ```
##### *FuzzUtils*
The *FuzzUtils* class includes a number of utility methods, from which the *Full_process* method is from greater importance as besides its main functionality it can also be used as a secondary function in some of the other fuzzy matching classes,
```{r, eval = F, echo = T} s1 = 'Frodo Baggins' init = FuzzUtils$new() init$Full_process(string = s1, force_ascii = TRUE) ``` ```{r, eval = F, echo = T} # example output [1] "frodo baggins" ```
##### *GetCloseMatches*
The *GetCloseMatches* method returns a list of the best "good enough" matches. The parameter *string* is a sequence for which close matches are desired (typically a character string), and *sequence_strings* is a list of sequences against which to match the parameter *string* (typically a list of strings).
```{r, eval = F, echo = T} vec = c('Frodo Baggins', 'Tom Sawyer', 'Bilbo Baggin') str1 = 'Fra Bagg' GetCloseMatches(string = str1, sequence_strings = vec, n = 2L, cutoff = 0.6) ``` ```{r, eval = F, echo = T} [1] "Frodo Baggins" ```
##### *SequenceMatcher*
The *SequenceMatcher* class is based on [difflib](https://www.npmjs.com/package/difflib) which comes by default installed with python and includes the following fuzzy string matching methods,
```{r, eval = F, echo = T} s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair.' s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair.' init = SequenceMatcher$new(string1 = s1, string2 = s2) init$ratio() [1] 0.9127517 ``` ```{r, eval = F, echo = T} init$quick_ratio() [1] 0.9127517 ``` ```{r, eval = F, echo = T} init$real_quick_ratio() [1] 0.966443 ```
The *get_matching_blocks* and *get_opcodes* return triples and 5-tuples describing matching subsequences. More information can be found in the [Python's difflib module](https://www.npmjs.com/package/difflib) and in the *fuzzywuzzyR* package documentation.
A last think to note here is that the mentioned fuzzy string matching classes can be parallelized using the base R *parallel* package. For instance, the following *MCLAPPLY_RATIOS* function can take two vectors of character strings (QUERY1, QUERY2) and return the scores for each method of the *FuzzMatcher* class,
```{r, eval = F, echo = T} MCLAPPLY_RATIOS = function(QUERY1, QUERY2, class_fuzz = 'FuzzMatcher', method_fuzz = 'QRATIO', threads = 1, ...) { init <- eval(parse(text = paste0(class_fuzz, '$new()'))) METHOD = paste0('init$', method_fuzz) if (threads == 1) { res_qrat = lapply(1:length(QUERY1), function(x) do.call(eval(parse(text = METHOD)), list(QUERY1[[x]], QUERY2[[x]], ...)))} else { res_qrat = parallel::mclapply(1:length(QUERY1), function(x) do.call(eval(parse(text = METHOD)), list(QUERY1[[x]], QUERY2[[x]], ...)), mc.cores = threads) } return(res_qrat) } ```
```{r, eval = F, echo = T} query1 = c('word1', 'word2', 'word3') query2 = c('similarword1', 'similar_word2', 'similarwor') quer_res = MCLAPPLY_RATIOS(query1, query2, class_fuzz = 'FuzzMatcher', method_fuzz = 'QRATIO', threads = 1) unlist(quer_res) ``` ```{r, eval = F, echo = T} # example output [1] 59 56 40 ```