Tom, good question, thanks for your post. Fuzzy matching is a feature that needs explanation because it is quite technical and often needs tweaking to get the proper results. You need a good understanding of the algorithms. The general approach to get the best matching results is twofold: first you seek the algorithm that gives the best results for a given data set (e.g. matching 80% of the data correctly). Next, from the mismatches you create a replacement list (see the actions) that prior to matching converts mismatches into something that will yield a correct match (a sort of exception list). In other words, the (only) way to correct mismatches is to prior replace the whole string, or parts of it, of the mismatch. Now, for the algorithms: The fuzzy match Levenshtein algorithm gives the following results (lower is better): "The Magic Pudding" vs "Magic Pudding, The": 9 "The Magic Pudding" vs "A Magic Pudding": 3 (best result!?) The Levenshtein algorithm looks at "edit distance" in a string: how many edit operations would I need to convert string A into string B? The most important thing to realize is that the algorithm doesn't look at /sentences/ (as humans automatically do), but at a /string/. Space, to it, is just another character. This is the main reason your examples don't give good results. So, how does it work? To go from "The Magic Pudding" to "Magic Pudding, The", we first do 4 insertions, which yields "The Magic Pudding, The". Next we do 5 deletions at the end (", The"), and we get the original string: "The Magic Pudding". This gives a combined edit distance of 9. If we go from "A Magic Pudding" we need 1 replacement (A->T) and 2 insertions (he), giving an edit distance of 3. That is why the latter scores better. As you see, the algorithm disregards word boundaries. Tip: search for Levenshtein in Wikipedia and read the explanation. The fuzzy match Jaro-Winkler algorithm gives the following results (higher is better): "The Magic Pudding" vs "Magic Pudding, The": 0,70 "The Magic Pudding" vs "A Magic Pudding": 0,87 (best result!?) The Jaro-Winkler algorithm looks at similarities (where Levenshtein looks at differences). As Levenshtein, Jaro-Winkler doesn't look at sentences. "A Magic Pudding" scores better here, because it doesn't need as many transpositions (reshuffling letters instead of inserting) as "Magic Pudding, The". Tip: search for Jaro-Winkler in Wikipedia and read the explanation. The fuzzy match DA1 algorithm gives the following results (lower is better): "The Magic Pudding" vs "Magic Pudding, The": -170 "The Magic Pudding" vs "A Magic Pudding": -390 (best result!?) DA1 does look at sentences, which offers an improvement over the other two algorithms. It does an edit distance for each word separately. Unfortunately, it is also sensitive to /word order/, in the sense that words that occur /at the same place/ in the sentence get rewarded. It also rewards matching words at the beginning of the sentence more than words at the end. That is why the match goes wrong. Why? We have custom designed this algorithm to do product matches ("Fiat turbo 3 doors green" is better matched by "Fiat turbo 5 doors red" than "Saab 3 doors turbo green" even though the latter matches 4 instead of 3 words with the first Fiat. What you would seem to need is an algorithmm that, like DA1, looks at sentences instead of strings. But that algorithm should not be sensitive to word order as DA1 is. The development team does plan to incorporate more fuzzy algorithms, and the aforementioned one is top of the list for a future release. If you can accomplish your task using the presnt algorithms is hard to say, it all depends on your list of strings you want to match. Kind regards, Support Team |