BigQuery: Convert accented characters to their plain ascii equivalents

Question

I have the following string:

brasília

And I need to convert to:

brasilia

Withou the ´ accent!

How can I do on BigQuery?

Thank you!

Just announced for StandardSQL: [normalize](http://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#normalize) — syber beynon, Sep 29 '17 at 11:56

Mikhail Berlyant · Accepted Answer · 2017-03-31T20:33:04.933

Try below as quick and simple option for you:

#standardSQL
WITH lookups AS (
  SELECT 
  'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,Ø,Å,Á,À,Â,Ä,È,É,Ê,Ë,Í,Î,Ï,Ì,Ò,Ó,Ô,Ö,Ú,Ù,Û,Ü,Ÿ,Ç,Æ,Œ,ñ' AS accents,
  'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,O,A,A,A,A,A,E,E,E,E,I,I,I,I,O,O,O,O,U,U,U,U,Y,C,AE,OE,n' AS latins
),
pairs AS (
  SELECT accent, latin FROM lookups, 
    UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1, 
    UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
  WHERE p1 = p2
),
yourTableWithWords AS (
  SELECT word FROM UNNEST(
        SPLIT('brasília,ångström,aperçu,barège, beau idéal, belle époque, béguin, bête noire, bêtise, Bichon Frisé, blasé, blessèd, bobèche, boîte, bombé, Bön, Boötes, boutonnière, bric-à-brac, Brontë Beyoncé,El Niño')
    ) AS word
)
SELECT 
  word AS word_with_accent, 
  (SELECT STRING_AGG(IFNULL(latin, char), '')
    FROM UNNEST(SPLIT(word, '')) char
    LEFT JOIN pairs
    ON char = accent) AS word_without_accent
FROM yourTableWithWords

Output is

word_with_accent word_without_accent     
blessèd         blessed  
El Niño         El Nino  
belle époque    belle epoque     
boîte           boite    
Boötes          Bootes   
blasé           blase    
ångström        angstrom     
bobèche         bobeche  
barège          barege   
bric-à-brac     bric-a-brac  
bête noire      bete noire   
Bichon Frisé    Bichon Frise     
Brontë Beyoncé  Bronte Beyonce   
bêtise          betise   
beau idéal      beau ideal   
bombé           bombe    
brasília        brasilia     
boutonnière     boutonniere  
aperçu          apercu   
béguin          beguin   
Bön             Bon

UPDATE

Below is how to pack this logic into SQL UDF - so accent2latin(word) can be called to make a "magic"

#standardSQL
CREATE TEMP FUNCTION accent2latin(word STRING) AS
((
  WITH lookups AS (
    SELECT 
    'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,Ø,Å,Á,À,Â,Ä,È,É,Ê,Ë,Í,Î,Ï,Ì,Ò,Ó,Ô,Ö,Ú,Ù,Û,Ü,Ÿ,Ç,Æ,Œ,ñ' AS accents,
    'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,O,A,A,A,A,A,E,E,E,E,I,I,I,I,O,O,O,O,U,U,U,U,Y,C,AE,OE,n' AS latins
  ),
  pairs AS (
    SELECT accent, latin FROM lookups, 
      UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1, 
      UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
    WHERE p1 = p2
  )
  SELECT STRING_AGG(IFNULL(latin, char), '')
  FROM UNNEST(SPLIT(word, '')) char
  LEFT JOIN pairs
  ON char = accent
));

WITH yourTableWithWords AS (
  SELECT word FROM UNNEST(
        SPLIT('brasília,ångström,aperçu,barège, beau idéal, belle époque, béguin, bête noire, bêtise, Bichon Frisé, blasé, blessèd, bobèche, boîte, bombé, Bön, Boötes, boutonnière, bric-à-brac, Brontë Beyoncé,El Niño')
    ) AS word
)
SELECT 
  word AS word_with_accent, 
  accent2latin(word) AS word_without_accent
FROM yourTableWithWords

Thank you a lot, @Mikhail. It's an awesome solution! I'm already using it with UDF like you defined. — Felipe Carlo, Mar 31 '17 at 20:41

sigpwned · Answer 2 · 2023-08-09T22:52:55.710

It's worth mentioning that what you're asking for is a simplified case of unicode text normalization. Many languages have a function for this in their standard libraries (e.g., Java). One good approach would be to insert your text into BigQuery already normalized. If that won't work -- for example, because you need to retain the original text and you're concerned about hitting BigQuery's row size limit -- then you'll need to do normalization on the fly in your queries.

Some databases have implementations of Unicode normalization of various completeness (e.g., PostgreSQL's unaccent method, PrestoDB's normalize method) for use in queries. Unfortunately, BigQuery is not one of them. There is no text normalization function in BigQuery as of this writing. (EDIT: This has changed! See below.) The implementations on this answer are kind of a "roll your own unaccent." When BigQuery releases an official function, everyone should use that instead!

Assuming you need to do the normalization in your query (and Google still hasn't come out with a function for this yet), these are some reasonable options.

Approach 1: Use `NORMALIZE`

Google now has come out with a NORMALIZE function. (Thanks to @WillianFuks in the comments for flagging!) This is now the obvious correct choice for text normalization. For example:

SELECT REGEXP_REPLACE(NORMALIZE(text, NFD), r"\pM", '') FROM yourtable;

There is a brief explanation of how this works and why the call to REGEXP_REPLACE is needed in the comments.

I have left the additional approaches for reference.

Approach 2: Use `REGEXP_REPLACE` and `REPLACE` on Content

I implemented the lowercase-only case of text normalization in legacy SQL using REGEXP_REPLACE. (The analog in Standard SQL is fairly self-evident.) I ran some tests on a text field with average length around 1K in a large table of 28M rows using the query below:

SELECT id, text FROM
  (SELECT 
    id,
    CASE
    WHEN REGEXP_CONTAINS(LOWER(text), r"[àáâäåæçèéêëìíîïòóôöøùúûüÿœ]") THEN
      REGEXP_REPLACE(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              REGEXP_REPLACE(
                REPLACE(REPLACE(REPLACE(REPLACE(LOWER(text), 'œ', 'ce'), 'ÿ', 'y'), 'ç', 'c'), 'æ', 'ae'),
              r"[ùúûü]", 'u'),
            r"[òóôöø]", 'o'),
          r"[ìíîï]", 'i'),
        r"[èéêë]", 'e'),
      r"[àáâäå]", 'a')
    ELSE
      LOWER(text)
    END AS text
  FROM
   yourtable ORDER BY id LIMIT 10);

versus:

WITH lookups AS (
  SELECT 
  'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,ñ' AS accents,
  'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,n' AS latins
),
pairs AS (
  SELECT accent, latin FROM lookups, 
    UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1, 
    UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
  WHERE p1 = p2
)
SELECT foo FROM (
  SELECT 
    id,
    (SELECT STRING_AGG(IFNULL(latin, char), '') AS foo FROM UNNEST(SPLIT(LOWER(text), '')) char LEFT JOIN pairs ON char=accent) AS foo
  FROM
  yourtable ORDER BY id LIMIT 10);

On average, the REGEXP_REPLACE implementation ran in about 2.9s; the array-based implementation ran in about 12.5s.

Approach 3: Use `REGEXP_REPLACE` on Search Pattern

What brought me to this question my was a search use case. For this use case, I can either normalize my corpus text so that it looks more like my query, or I can "denormalize" my query so that it looks more like my text. The above describes an implementation of the first approach. This describes an implementation of the second.

When searching for a single word, one can use the REGEXP_MATCH match function and merely update the query using the following patterns:

a -> [aàáaâäãåā]
e -> [eèéêëēėę]
i -> [iîïíīįì]
o -> [oôöòóøōõ]
u -> [uûüùúū]
y -> [yÿ]
s -> [sßśš]
l -> [lł]
z -> [zžźż]
c -> [cçćč]
n -> [nñń]
æ -> (?:æ|ae)
œ -> (?:œ|ce)

So the query "hello" would look like this, as a regexp:

r"h[eèéêëēėę][lł][lł][oôöòóøōõ]"

Transforming the word into this regular expression should be fairly straightforward in any language. This isn't a solution to the posted question -- "How do I remove accents in BigQuery?" -- but is rather a solution to a related use case, which might have brought people (like me!) to this page.

I wonder how the normalization approach would work to solve this issue; I tried using the new [normalize](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#normalize) function but it didn't work. Do you know how to make it? — Willian Fuks, Dec 18 '17 at 16:59
When you used the new `NORMALIZE` function, did you simply call `NORMALIZE`? By default, that function will decompose the "alphabetical" character from the "diacritic" characters to create a standardized sequence representation of accented characters, but it will not remove the diacritics. If you want to remove the diacritics, you need to perform another step; something like `REGEXP_REPLACE(s, r"[^\pL\pN\pP\pS\pZ]", '')` or `REGEXP_REPLACE(s, r"[\pM]", '')` will probably do the trick. (See [here](https://github.com/google/re2/wiki/Syntax) for the unicode character classes used here.) — sigpwned, Dec 20 '17 at 02:16
I miss NFD or NFKD in yours NORMALIZE like: NORMALIZE(nome, NFD). See [this post](https://stackoverflow.com/a/48509105/2048848) for more details. — Murta, Jul 07 '20 at 19:48

Murta · Answer 3 · 2020-12-09T02:28:20.617

I like this answer explanation. You can use:

REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '')

As a simple example:

WITH data AS(
  SELECT 'brasília / paçoca' AS text
)

SELECT
  REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') RemovedDiacritics
FROM data

brasilia / pacoca

UPDATE

With the new string function Translate, it's much simpler to do it:

WITH data AS(
  SELECT 'brasília / paçoca' AS text
)

SELECT
  translate(text, "ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöùúûüýÿ", "SZszYAAAAAACEEEEIIIIDNOOOOOUUUUYaaaaaaceeeeiiiidnooooouuuuyy") as RemovedDiacritics
FROM data

brasilia / pacoca

`translate` should now become the accepted answer. – Valentas Apr 15 '21 at 07:24 — Valentas, Apr 15 '21 at 07:24

score 1 · Answer 4 · edited May 23 '17 at 12:02

1

You can call REPLACE() or REGEXP_REPLACE(). You can find some regular expressions at Remove accents/diacritics in a string in JavaScript.

Alternatively, you can use javascript UDF, but I expect it to be slower.

edited May 23 '17 at 12:02

Community

1
1

answered Mar 31 '17 at 19:40

Mingyu Zhong

471
4
3

score 0 · Answer 5 · answered Dec 06 '22 at 23:05

0

Just call --> select bigfunctions.eu.remove_accents('Voilà !') as cleaned_string

(BigFunctions are open-source BigQuery functions callable by anyone from their BigQuery Project)

https://unytics.io/bigfunctions/reference/#remove_accents

answered Dec 06 '22 at 23:05

Paul Marcombes

137
4

BigQuery: Convert accented characters to their plain ascii equivalents

5 Answers5

Approach 1: Use `NORMALIZE`

Approach 2: Use `REGEXP_REPLACE` and `REPLACE` on Content

Approach 3: Use `REGEXP_REPLACE` on Search Pattern

Linked

BigQuery: Convert accented characters to their plain ascii equivalents

5 Answers5

Approach 1: Use NORMALIZE

Approach 2: Use REGEXP_REPLACE and REPLACE on Content

Approach 3: Use REGEXP_REPLACE on Search Pattern

Linked

Approach 1: Use `NORMALIZE`

Approach 2: Use `REGEXP_REPLACE` and `REPLACE` on Content

Approach 3: Use `REGEXP_REPLACE` on Search Pattern