Oracle Fuzzy Text Matching

From WickyWiki
Revision as of 10:03, 28 July 2018 by Wilbert (talk | contribs)

Oracle Text

To be able to "Googele" your database Oracle developed the "Oracle Text" product (Oracle 10g). Configuration however is complicated, with more functionalities then you will ever need. That's why some quick and dirty alternatives on this page.

More info:

Like

With _LIKE_ you can use wildcards to match text fields. '?' is used to match any single character and '%' is used to match none or more arbitrary characters:

select kolom 
  from tabel 
  where kolom like 'een?waarde%'
  ;
kolom
------
een waarde
een_waarde 123
een_waarde 12%
een%waarde ABC

Define an escape character if you want to match a wildcard character ('%' or '?'):

select kolom
  from tabel 
  where kolom like '%\%%' escape '\'
  ;
kolom
------
een_waarde 12%
een%waarde ABC

Soundex

The Soundex function translates a string to a numer, representing the 'sound' of the text. You can compare texts that do not _exactly_ match:

select *
  from tabel 
  where soundex(kolom1) =  soundex(kolom2)
  ;

Package UTL_MATCH

Oracle 10g contains the package UTL_MATCH with a number of functies to find matches. The following paragraphs will show some possibilities by example:

Used functions:

select kolom1, kolom2
  , convert(Upper(Trim(kolom1)),'us7ascii')
  , convert(Upper(Trim(kolom2)),'us7ascii')
  , UTL_MATCH.JARO_WINKLER_SIMILARITY(convert(Upper(Trim(kolom1)),'us7ascii') , convert(Upper(Trim(kolom2)),'us7ascii')) as Winkler
  , UTL_MATCH.EDIT_DISTANCE_SIMILARITY(convert(Upper(Trim(kolom1)),'us7ascii'),convert(Upper(Trim(kolom2)),'us7ascii')) as Distance
  from ( 
    select 'Eindhoven' as kolom1, 'Veldhoven' as kolom2 from dual 
    union
    select 'Eindhoven', 'ëïñdhóvèn' from dual 
    union
    select 'Eindhoven ', 'EiNdHovEn' from dual 
    union
    select 'Eindhoven', 'Helmond' from dual 
  )
  order by Winkler desc
  ;

Selecting the 'best' Winkler-match, with a minumum of 95% to limit the results:

select * from (
  select kolom1, kolom2
    , UTL_MATCH.JARO_WINKLER_SIMILARITY(
          convert(Upper(Trim(kolom1)),'us7ascii')
        , convert(Upper(Trim(kolom2)),'us7ascii')) as Winkler
    from (
      select 'Eindhoven' as kolom1, 'Veldhoven' as kolom2 from dual
      union
      select 'Eindhoven', 'ëïñdhóvèn' from dual
      union
      select 'Eindhoven ', 'EiNdHovEn' from dual
      union
      select 'Eindhoven', 'Helmond' from dual
    )
    order by Winkler desc
  ) where rownum<=1 -- TOP N rows
    and Winkler>=95 -- minimum match %
;