Oracle Fuzzy Text Matching
Oracle Text
To be able to "Googele" your database Oracle developed the "Oracle Text" product (Oracle 10g). Configuration however is complicated, with more functionalities then you will ever need. That's why some quick and dirty alternatives on this page.
More info:
Like
With _LIKE_ you can use wildcards to match text fields. '?' is used to match any single character and '%' is used to match none or more arbitrary characters:
select kolom from tabel where kolom like 'een?waarde%' ;
kolom ------ een waarde een_waarde 123 een_waarde 12% een%waarde ABC
Define an escape character if you want to match a wildcard character ('%' or '?'):
select kolom from tabel where kolom like '%\%%' escape '\' ;
kolom ------ een_waarde 12% een%waarde ABC
Soundex
The Soundex function translates a string to a numer, representing the 'sound' of the text. You can compare texts that do not _exactly_ match:
select * from tabel where soundex(kolom1) = soundex(kolom2) ;
Package UTL_MATCH
Oracle 10g contains the package UTL_MATCH with a number of functies to find matches. The following paragraphs will show some possibilities by example:
Used functions:
- TRIM
- remove leading and trailing spaces and tabs
- UPPER
- to upper case
- CONVERT (kolom1,'us7ascii')
- replace special characters , like in 'ëïñdhóvèn' -> 'eindhoven'
- UTL_MATCH.JARO_WINKLER_SIMILARITY(kolom1,kolom2)
- jaro-winkler matching %, ("Wikipedia":http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)
- UTL_MATCH.EDIT_DISTANCE_SIMILARITY(kolom1,kolom2)
- levenshtein-distance matching %, ("Wikipedia":http://en.wikipedia.org/wiki/Levenshtein_distance)
select kolom1, kolom2
, convert(Upper(Trim(kolom1)),'us7ascii')
, convert(Upper(Trim(kolom2)),'us7ascii')
, UTL_MATCH.JARO_WINKLER_SIMILARITY(convert(Upper(Trim(kolom1)),'us7ascii') , convert(Upper(Trim(kolom2)),'us7ascii')) as Winkler
, UTL_MATCH.EDIT_DISTANCE_SIMILARITY(convert(Upper(Trim(kolom1)),'us7ascii'),convert(Upper(Trim(kolom2)),'us7ascii')) as Distance
from (
select 'Eindhoven' as kolom1, 'Veldhoven' as kolom2 from dual
union
select 'Eindhoven', 'ëïñdhóvèn' from dual
union
select 'Eindhoven ', 'EiNdHovEn' from dual
union
select 'Eindhoven', 'Helmond' from dual
)
order by Winkler desc
;
Selecting the 'best' Winkler-match, with a minumum of 95% to limit the results:
select * from (
select kolom1, kolom2
, UTL_MATCH.JARO_WINKLER_SIMILARITY(
convert(Upper(Trim(kolom1)),'us7ascii')
, convert(Upper(Trim(kolom2)),'us7ascii')) as Winkler
from (
select 'Eindhoven' as kolom1, 'Veldhoven' as kolom2 from dual
union
select 'Eindhoven', 'ëïñdhóvèn' from dual
union
select 'Eindhoven ', 'EiNdHovEn' from dual
union
select 'Eindhoven', 'Helmond' from dual
)
order by Winkler desc
) where rownum<=1 -- TOP N rows
and Winkler>=95 -- minimum match %
;