Oracle Fuzzy Text Matching: Difference between revisions
m 1 revision |
mNo edit summary |
||
| Line 3: | Line 3: | ||
== Oracle Text == | == Oracle Text == | ||
To be able to "Googele" your database Oracle developed the "Oracle Text" product (Oracle 10g). Configuration however is complicated, with more functionalities then you will ever need. That's why some quick and dirty alternatives on this page. | |||
More info: | |||
* "Oracle Text":http://www.oracle.com/technetwork/database/enterprise-edition/index-098492.html | * "Oracle Text":http://www.oracle.com/technetwork/database/enterprise-edition/index-098492.html | ||
== Like == | == Like == | ||
With _LIKE_ you can use wildcards to match text fields. '?' is used to match any single character and '%' is used to match none or more arbitrary characters: | |||
<syntaxhighlight lang=sql> | <syntaxhighlight lang=sql> | ||
| Line 28: | Line 28: | ||
</pre> | </pre> | ||
Define an escape character if you want to match a wildcard character ('%' or '?'): | |||
<syntaxhighlight lang=sql> | <syntaxhighlight lang=sql> | ||
| Line 46: | Line 46: | ||
== Soundex == | == Soundex == | ||
The Soundex function translates a string to a numer, representing the 'sound' of the text. You can compare texts that do not _exactly_ match: | |||
<syntaxhighlight lang=sql> | <syntaxhighlight lang=sql> | ||
| Line 57: | Line 57: | ||
== Package UTL_MATCH == | == Package UTL_MATCH == | ||
Oracle 10g | Oracle 10g contains the package UTL_MATCH with a number of functies to find matches. The following paragraphs will show some possibilities by example: | ||
Used functions: | |||
* TRIM | * TRIM | ||
** spaties en tabs rondom de tekst verwijderen | ** spaties en tabs rondom de tekst verwijderen | ||
| Line 90: | Line 90: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Selecting the 'best' Winkler-match, with a minumum of 95% to limit the results: | |||
<syntaxhighlight lang=sql> | <syntaxhighlight lang=sql> | ||
Revision as of 10:03, 28 July 2018
Oracle Text
To be able to "Googele" your database Oracle developed the "Oracle Text" product (Oracle 10g). Configuration however is complicated, with more functionalities then you will ever need. That's why some quick and dirty alternatives on this page.
More info:
Like
With _LIKE_ you can use wildcards to match text fields. '?' is used to match any single character and '%' is used to match none or more arbitrary characters:
select kolom from tabel where kolom like 'een?waarde%' ;
kolom ------ een waarde een_waarde 123 een_waarde 12% een%waarde ABC
Define an escape character if you want to match a wildcard character ('%' or '?'):
select kolom from tabel where kolom like '%\%%' escape '\' ;
kolom ------ een_waarde 12% een%waarde ABC
Soundex
The Soundex function translates a string to a numer, representing the 'sound' of the text. You can compare texts that do not _exactly_ match:
select * from tabel where soundex(kolom1) = soundex(kolom2) ;
Package UTL_MATCH
Oracle 10g contains the package UTL_MATCH with a number of functies to find matches. The following paragraphs will show some possibilities by example:
Used functions:
- TRIM
- spaties en tabs rondom de tekst verwijderen
- UPPER
- alles hoofdletters maken
- CONVERT (kolom1,'us7ascii')
- accenten vervangen, zoals in 'ëïñdhóvèn' -> 'eindhoven'
- UTL_MATCH.JARO_WINKLER_SIMILARITY(kolom1,kolom2)
- jaro-winkler matching %, ("Wikipedia":http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)
- UTL_MATCH.EDIT_DISTANCE_SIMILARITY(kolom1,kolom2)
- levenshtein-distance matching %, ("Wikipedia":http://en.wikipedia.org/wiki/Levenshtein_distance)
select kolom1, kolom2
, convert(Upper(Trim(kolom1)),'us7ascii')
, convert(Upper(Trim(kolom2)),'us7ascii')
, UTL_MATCH.JARO_WINKLER_SIMILARITY(convert(Upper(Trim(kolom1)),'us7ascii') , convert(Upper(Trim(kolom2)),'us7ascii')) as Winkler
, UTL_MATCH.EDIT_DISTANCE_SIMILARITY(convert(Upper(Trim(kolom1)),'us7ascii'),convert(Upper(Trim(kolom2)),'us7ascii')) as Distance
from (
select 'Eindhoven' as kolom1, 'Veldhoven' as kolom2 from dual
union
select 'Eindhoven', 'ëïñdhóvèn' from dual
union
select 'Eindhoven ', 'EiNdHovEn' from dual
union
select 'Eindhoven', 'Helmond' from dual
)
order by Winkler desc
;
Selecting the 'best' Winkler-match, with a minumum of 95% to limit the results:
select * from (
select kolom1, kolom2
, UTL_MATCH.JARO_WINKLER_SIMILARITY(
convert(Upper(Trim(kolom1)),'us7ascii')
, convert(Upper(Trim(kolom2)),'us7ascii')) as Winkler
from (
select 'Eindhoven' as kolom1, 'Veldhoven' as kolom2 from dual
union
select 'Eindhoven', 'ëïñdhóvèn' from dual
union
select 'Eindhoven ', 'EiNdHovEn' from dual
union
select 'Eindhoven', 'Helmond' from dual
)
order by Winkler desc
) where rownum<=1 -- TOP N rows
and Winkler>=95 -- minimum match %
;