Welcome to 16892 Developer Community-Open, Learning,Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Working with R, I'm looking for ways to weight case (i.e., upper vs lower case) in a string_dist_left_join()

Here's a reproducible example:

library(tidyverse)
library(fuzzyjoin)

tibble1 <- tibble(words = c("Bedford", "Maidenhead", "New Forest", "Tier 3", "Citizenship", "Crown"))

tibble2 <- tibble(words = c("bedfords", "bedsford", "BEDFord", "Maidenshead", "Maidenhed", "News forest", "Tier 3", "Citisenships", "crowned", "crows"))

osa <- stringdist_left_join(tibble1, tibble2, distance_col = "distance", max_dist = 5, method = "osa", weight = c(d = 0.1, i = 0.1, s = 1, t = 1))

Above is the code to reproduce a fuzzyjoin powered stringsidt_left_join on a couple of tibbles. The output looks like this:

# A tibble: 55 x 3
   words.x words.y      distance
   <chr>   <chr>           <dbl>
 1 Bedford bedfords        0.3  
 2 Bedford bedsford        0.3  
 3 Bedford BEDFord         0.6  
 4 Bedford Maidenshead     1.4  
 5 Bedford Maidenhed       1.2  
 6 Bedford News forest     1.00 
 7 Bedford Tier 3          0.900
 8 Bedford Citisenships    1.7  
 9 Bedford crowned         1.00 
10 Bedford crows           1.00 
# … with 45 more rows

What I'd like is for some way to weight the capitalisation e.g., comparing Bedford to BEDford: I'd like that to be a worse match than Bedford to Bedford, but better than Bedford to Bedsford. The option ignore_case = TRUE treats BEDford as a perfect match with Bedford.

I'm liking the fuzzyjoin package, and I just discovered the custom weightings that you can pass to stringdist for each of deletion, insertion, substitution, and translocation. Which is fantastic; toys to play with, parameters to tune.

What I'd also like to be able to do is tune the case (capitalisation?) matching. I've got the option to ignore_case = TRUE in stringdist_left_join, (in effect, weight case as 0 or 1), but being the annoying cur that I am, I'd like to play around with weightings between 0 and 1.

Does anyone know if there's an option somewhere that I'm missing?

Or is the answer: Do it the hard way? I guess there might be a long way round involving comparing the distances before and after having run tolower() or computing a weighted distance comparing ignore_case = TRUE with ignore_case = FALSE, but does anyone know of a more elegant method or package that I can use to do that?

Thanks


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
522 views
Welcome To Ask or Share your Answers For Others

1 Answer

You could run it twice, once with ignore_case = TRUE and once with FALSE and then find an appropriate linear combination of the two distances.

Something like lambda * (distance_FALSE - distance_TRUE) + distance_TRUE where lambda is how much less you care about capitalisation differences than other string differences.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to 16892 Developer Community-Open, Learning and Share
...