1

I'm trying to translate a query to run in BigQuery that uses the OTRANSLATE function from Teradata. For example,

SELECT OTRANSLATE(text, 'ehlo', 'EHLO')
FROM (
  SELECT 'hello world' AS text UNION ALL
  SELECT 'elliott'
);

This should produce:

HELLO wOrLd
ELLiOtt

Is there any way of expressing this function in BigQuery? It doesn't look like there is a direct equivalent.

Elliott Brossard
  • 32,095
  • 2
  • 67
  • 99

2 Answers2

2

Another, slightly different approach (BigQuery Standard SQL)

#standardSQL
CREATE TEMP FUNCTION OTRANSLATE(text STRING, from_string STRING, to_string STRING) AS ((
  SELECT STRING_AGG(IFNULL(y, a), '' ORDER BY pos)
  FROM UNNEST(SPLIT(text, '')) a WITH OFFSET pos
  LEFT JOIN (
    SELECT x, y
    FROM UNNEST(SPLIT(from_string, '')) x WITH OFFSET
    JOIN UNNEST(SPLIT(to_string, '')) y WITH OFFSET
    USING(OFFSET)
  )
  ON a = x
));
WITH `project.dataset.table` AS (
  SELECT 'hello world' AS text UNION ALL
  SELECT 'elliott'
)
SELECT text, OTRANSLATE(text, 'ehlo', 'EHLO') as new_text
FROM `project.dataset.table`   

with output

Row     text            new_text     
1       hello world     HELLO wOrLd  
2       elliott         ELLiOtt   

Note: above version assumes from and to strings of equal length and no repeated chars in from string

Update to follow up on "expanded expectations" for the version of that function in BigQuery

#standardSQL
CREATE TEMP FUNCTION OTRANSLATE(text STRING, from_string STRING, to_string STRING) AS ((
  SELECT STRING_AGG(IFNULL(y, a), '' ORDER BY pos)
  FROM UNNEST(SPLIT(text, '')) a WITH OFFSET pos
  LEFT JOIN (
    SELECT x, ARRAY_AGG(IFNULL(y, '') ORDER BY OFFSET LIMIT 1)[OFFSET(0)] y
    FROM UNNEST(SPLIT(from_string, '')) x WITH OFFSET
    LEFT JOIN UNNEST(SPLIT(to_string, '')) y WITH OFFSET
    USING(OFFSET)
    GROUP BY x
  )
  ON a = x
));
SELECT -- text, OTRANSLATE(text, 'ehlo', 'EHLO') as new_text
  OTRANSLATE("hello world", "", "EHLO") AS empty_from, -- 'hello world'
  OTRANSLATE("hello world", "hello world1", "EHLO") AS larger_from_than_source, -- 'EHLLL'
  OTRANSLATE("hello world", "ehlo", "EHLO") AS equal_size_from_to, -- 'HELLO wOrLd'
  OTRANSLATE("hello world", "ehlo", "EH") AS larger_size_from, -- 'HE wrd'
  OTRANSLATE("hello world", "ehlo", "EHLOPQ") AS larger_size_to, -- 'hello world'
  OTRANSLATE("hello world", "ehlo", "") AS empty_to; -- 'wrd'

with result

Row empty_from  larger_from_than_source equal_size_from_to  larger_size_from    larger_size_to  empty_to     
1   hello world EHLLL                   HELLO wOrLd             HE wrd          HELLO wOrLd     wrd    
.   

Note: Teradata version of this function is recursive, so the current implementation is not exact implementation of Teradata's OTRANSLATE

Usage Notes (from teradata documentation)
If the first character in from_string occurs in the source_string, all occurrences of it are replaced by the first character in to_string. This repeats for all characters in from_string and for all characters in from_string. The replacement is performed character-by-character, that is, the replacement of the second character is done on the string resulting from the replacement of the first character.

This can be easily implemented with JS UDF which is trivial, I think so I am not going this direction :o)

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
1

Yes, you can do this using array operations over the strings. Here is one solution:

CREATE TEMP FUNCTION OTRANSLATE(s STRING, key STRING, value STRING) AS (
  (SELECT
     STRING_AGG(
       IFNULL(
         (SELECT value[OFFSET(
            SELECT o FROM UNNEST(SPLIT(key, '')) AS k WITH OFFSET o2
            WHERE k = c)]
         ),
         c),
       '' ORDER BY o1)
   FROM UNNEST(SPLIT(s, '')) AS c WITH OFFSET o1)
  )
);

SELECT OTRANSLATE(text, 'ehlo', 'EHLO')
FROM (
  SELECT 'hello world' AS text UNION ALL
  SELECT 'elliott'
);

The idea is to find the character in the same position of the key string in the value string. If there is no matching character in the key string, we end up with a null offset, so the second argument to IFNULL causes it to return the unmapped character. We then aggregate back into a string, ordered by character offset.

Edit: Here's a variant that also handles differences in the key and value lengths:

CREATE TEMP FUNCTION otranslate(s STRING, key STRING, value STRING) AS (
  IF(LENGTH(key) < LENGTH(value) OR LENGTH(s) < LENGTH(key), s,
  (SELECT
     STRING_AGG(
       IFNULL(
         (SELECT ARRAY_CONCAT([c], SPLIT(value, ''))[SAFE_OFFSET((
            SELECT IFNULL(MIN(o2) + 1, 0) FROM UNNEST(SPLIT(key, '')) AS k WITH OFFSET o2
            WHERE k = c))]
         ),
         ''),
       '' ORDER BY o1)
   FROM UNNEST(SPLIT(s, '')) AS c WITH OFFSET o1
  ))
);
SELECT
  otranslate("hello world", "", "EHLO") AS empty_from, -- 'hello world'
  otranslate("hello world", "hello world1", "EHLO") AS larger_from_than_source, -- 'hello world'
  otranslate("hello world", "ehlo", "EHLO") AS equal_size_from_to, -- 'HELLO wOrLd'
  otranslate("hello world", "ehlo", "EH") AS larger_size_from, -- 'HE wrd'
  otranslate("hello world", "ehlo", "EHLOPQ") AS larger_size_to, -- 'hello world'
  otranslate("hello world", "ehlo", "") AS empty_to; -- 'wrd'
Elliott Brossard
  • 32,095
  • 2
  • 67
  • 99