contact  |  about  |  sitemap

Fuzzy compare
Last Post 07 Apr 2009 10:05 AM by Support Team. 4 Replies.
Sort:
PrevPrev NextNext
Author Messages
Tom

--
01 Apr 2009 08:11 AM
Hi

I've been trialling the Enterprise edition, and the Fuzzy compare feature in particular, and am not quite getting the results I anticipated. The following example is applicable to all three fuzzy algorithm options that DJ uses.

Say I'm comparing a book title entry in a data grid cell with a variable value: "The Magic Pudding" and doing a fuzzy compare to another value "Magic Pudding, The" - I get a greater mismatch value for this entry than I would for a value: "A Magic Pudding". Clearly the last value is an obvious error, however the first value would be considered a close match.

What is the best way to get the match for the first example and a mismatch for the latter? Would I require a replacement list entry for these types of scenarios?

Thanks for any reply.
Support Team

--
01 Apr 2009 10:25 AM
Tom, good question, thanks for your post. Fuzzy matching is a feature that needs explanation because it is quite technical and often needs tweaking to get the proper results. You need a good understanding of the algorithms.

The general approach to get the best matching results is twofold: first you seek the algorithm that gives the best results for a given data set (e.g. matching 80% of the data correctly). Next, from the mismatches you create a replacement list (see the actions) that prior to matching converts mismatches into something that will yield a correct match (a sort of exception list). In other words, the (only) way to correct mismatches is to prior replace the whole string, or parts of it, of the mismatch.

Now, for the algorithms:

The fuzzy match Levenshtein algorithm gives the following results (lower is better):
"The Magic Pudding" vs "Magic Pudding, The": 9
"The Magic Pudding" vs "A Magic Pudding": 3 (best result!?)

The Levenshtein algorithm looks at "edit distance" in a string: how many edit operations would I need to convert string A into string B? The most important thing to realize is that the algorithm doesn't look at /sentences/ (as humans automatically do), but at a /string/. Space, to it, is just another character. This is the main reason your examples don't give good results.
So, how does it work? To go from "The Magic Pudding" to "Magic Pudding, The", we first do 4 insertions, which yields "The Magic Pudding, The". Next we do 5 deletions at the end (", The"), and we get the original string: "The Magic Pudding". This gives a combined edit distance of 9. If we go from "A Magic Pudding" we need 1 replacement (A->T) and 2 insertions (he), giving an edit distance of 3. That is why the latter scores better. As you see, the algorithm disregards word boundaries.
Tip: search for Levenshtein in Wikipedia and read the explanation.

The fuzzy match Jaro-Winkler algorithm gives the following results (higher is better):
"The Magic Pudding" vs "Magic Pudding, The": 0,70
"The Magic Pudding" vs "A Magic Pudding": 0,87 (best result!?)

The Jaro-Winkler algorithm looks at similarities (where Levenshtein looks at differences). As Levenshtein, Jaro-Winkler doesn't look at sentences. "A Magic Pudding" scores better here, because it doesn't need as many transpositions (reshuffling letters instead of inserting) as "Magic Pudding, The".
Tip: search for Jaro-Winkler in Wikipedia and read the explanation.

The fuzzy match DA1 algorithm gives the following results (lower is better):
"The Magic Pudding" vs "Magic Pudding, The": -170
"The Magic Pudding" vs "A Magic Pudding": -390 (best result!?)
DA1 does look at sentences, which offers an improvement over the other two algorithms. It does an edit distance for each word separately. Unfortunately, it is also sensitive to /word order/, in the sense that words that occur /at the same place/ in the sentence get rewarded. It also rewards matching words at the beginning of the sentence more than words at the end. That is why the match goes wrong. Why? We have custom designed this algorithm to do product matches ("Fiat turbo 3 doors green" is better matched by "Fiat turbo 5 doors red" than "Saab 3 doors turbo green" even though the latter matches 4 instead of 3 words with the first Fiat.

What you would seem to need is an algorithmm that, like DA1, looks at sentences instead of strings. But that algorithm should not be sensitive to word order as DA1 is. The development team does plan to incorporate more fuzzy algorithms, and the aforementioned one is top of the list for a future release.

If you can accomplish your task using the presnt algorithms is hard to say, it all depends on your list of strings you want to match.

Kind regards,
Support Team
Tom

--
01 Apr 2009 11:34 PM
Thanks for the reply and detailed explanation Support Team. It would be very handy if you could implement another algorithm type to accomodate phrases such as in my example. Working in a library, its quite common to have to deal with numerous data entry variations such as these.

The way I've now tried to approach this problem is to use a regular expression with the 'Match Text" action to capture all instances where there would be an ending article in a variable value (i.e ', The',', A ') and delete it from the end and place it at the beginning of the sentence. The problem is, I don't know how to achieve this. Capturing the ending articles is no problem, but from there on I don't know what to do. I've tried the Subtract Variable action to delete the capture, but I get an error message saying that its the wrong data type. I believe that this action works with both text and number variables, as 'Help' states, so I must be doing something wrong. Would you be so kind as to to assist me in this?

Thanks
Tom
Tom

--
02 Apr 2009 12:53 AM
....It's OK, I've solved the problem....after a morning coffee! - Using the 'Match and Replace Text' action. I would still like to know how the Add and Subract variable actions work on text.
Support Team

--
07 Apr 2009 10:05 AM
Yes, it is a good idea to use a regex. Glad it works!

The Subtract action doesn't work for text (add does), only for datetime and number. It gives an error if you try to use text. The Add toVariable action works for all types. The text type was allowed (incorrectly) in this action however - we have changed this for the next release.


Quick Reply
toggle
  Username:
Subject:
Body:
Security Code:
Enter the code shown above:

Submit

Powered by Active Forums

Forum participation and optional registration

You don't need to be registered to partcipate in the Djuggler forums, however if you want to subscribe to email notifications you need to register. You can also subscribe to the forum RSS feed.