REGEXP_REPLACE second argument must be const and non-null
Is there any
workaround for that?
Below is just an idea/direction to address above question applied to logic you described:
I would "cook" these two strings upfront into 'ga|an|na' and
'ga|an|no' (lists of 2-grams) and do this: REGEXP_REPLACE('ga|an|na',
'ga|an|no', ''). Then based on change in length I can calculate my
coeff.
The "workaround" is:
SELECT a.w AS w1, b.w AS w2, SUM(a.x = b.x) / COUNT(1) AS c
FROM (
SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
FROM
(SELECT 'gana' AS w, 'ga|an|na' AS p)
) AS a
JOIN (
SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
FROM
(SELECT 'gano' AS w, 'ga|an|no' AS p),
(SELECT 'gamo' AS w, 'ga|am|mo' AS p),
(SELECT 'kana' AS w, 'ka|an|na' AS p)
) AS b
ON a.pos = b.pos
GROUP BY w1, w2
Maybe there is a better way to do it?
Below is the simple example of how Pair Similarity can be approached here (including building bigrams sets and calculation of coefficient:
SELECT
a.word AS word1, b.word AS word2,
2 * SUM(a.bigram = b.bigram) /
(EXACT_COUNT_DISTINCT(a.bigram) + EXACT_COUNT_DISTINCT(b.bigram) ) AS c
FROM (
SELECT word, char + next_char AS bigram
FROM (
SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
FROM (
SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
FROM
(SELECT 'gana' AS word)
)
)
WHERE next_char IS NOT NULL
GROUP BY 1, 2
) a
CROSS JOIN (
SELECT word, char + next_char AS bigram
FROM (
SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
FROM (
SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
FROM
(SELECT 'gano' AS word)
)
)
WHERE next_char IS NOT NULL
GROUP BY 1, 2
) b
GROUP BY 1, 2