1

I'm trying to build a regex to find a word inside a table in bigquery that can contain accents. The input could have accents or not. I'm coding it in scala, so the input can be transformed in scala or using functions of bigquery. I guess a good solution could be something like that:

input: serie
word to find: Séries

I can transform the input in

r'(?i:s[éèe]r[íìi][éèe]s)'

but I think it would be better to transform the column of the table in something like:

r'(?i:s[ée]ries)'

I don't know how to solve the second situation or if a better way to solve it. Thanks in advance

  • 1
    check http://stackoverflow.com/a/43148949/5221944 - you can adopt approach for there - if this will not help - please clarify your use-case - it is not clear enough I think – Mikhail Berlyant Apr 04 '17 at 18:55

2 Answers2

1

I'm trying to build a regex to find a word inside a table in bigquery that can contain accents.

Play with example below

so the input can be transformed in scala or using functions of bigquery

It does the whole stuff in BigQuery Standard SQL

#standardSQL
CREATE TEMP FUNCTION latin2accents(word STRING) AS
((
  WITH lookups AS (
    SELECT 
    'y,a,e,i,o,u,c,ç,n,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,Ø,Å,Á,À,Â,Ä,È,É,Ê,Ë,Í,Î,Ï,Ì,Ò,Ó,Ô,Ö,Ú,Ù,Û,Ü,Ÿ,Ç,Æ,Œ,ñ' AS accents,
    'y,a,e,i,o,u,c,c,n,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,O,A,A,A,A,A,E,E,E,E,I,I,I,I,O,O,O,O,U,U,U,U,Y,C,AE,OE,n' AS latins
  ), pairs AS (
    SELECT accent, latin FROM lookups, 
      UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1, 
      UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
    WHERE p1 = p2
  ), map AS (
    SELECT latin, CONCAT('[', STRING_AGG(accent, ''), ']') AS accents
    FROM pairs  
    GROUP BY latin
  )
  SELECT CONCAT('(?i:', STRING_AGG(IFNULL(accents, char), ''), ')')
  FROM UNNEST(SPLIT(word, '')) char
  LEFT JOIN map
  ON char = latin
));

WITH yourTable AS (
  SELECT 'Séries' AS word UNION ALL SELECT 'Series' UNION ALL
  SELECT 'brasília' UNION ALL SELECT 'Niño' UNION ALL SELECT 'aperçu' 
), inputs AS (
  SELECT 'series' AS input UNION ALL SELECT 'Brasilia' UNION ALL
  SELECT 'nino' UNION ALL SELECT 'apercu'
)
SELECT input, word AS found_word 
FROM yourTable CROSS JOIN inputs
WHERE REGEXP_CONTAINS(word,  latin2accents(input)) = TRUE
ORDER BY input, word

The output is (which I think is what you wanted)

input       found_word   
-----       ----------
Brasilia    brasília     
apercu      aperçu   
nino        Niño     
series      Series   
series      Séries   
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
1

Using the Java Collator (see for example this answer), you could take a perhaps simpler approach:

scala> val c = java.text.Collator.getInstance()
c: java.text.Collator = java.text.RuleBasedCollator@289747d6
scala> c.setStrength(Collator.PRIMARY)
scala> c.equals("é","E")
res24: Boolean = true
scala> c.setStrength(Collator.SECONDARY)
scala> c.equals("é","E")
res26: Boolean = false

This takes into account your locale, as the rules for what constitutes a 'base' (equivalent) character are going to be different per language.

wwkudu
  • 2,778
  • 3
  • 28
  • 41