2

I'm currently working with a database in Googlesheets containing app publishers linked to financial apps. I plan to match it with another list that has multiple bank names from multiple countries. The problem is that the first database has the names of the publishers as the original names, but second list has some bank names translated to English and some of them have the original native name. Specifically, the names of banks that appear to be written originally with a non-Latin script (such as Korean, Cyrillic, Japanese or Arabic) have been translated to English, and the names of banks that are written in a Latin-script-based language (as Spanish, Romanian, Slovenian or French) all appear to be not translated, just without any diacritic.

Because of this, i'm trying to use Regex in Googlesheets in order to check the cells for non-Latin Unicode letter characters (and that's why this question doesn't help, nor this one). Since currently Googlesheets' REGEXMATCH is incompatible with Unicode class characters, i've been forced to use the QUERY function (following this answer). Let's say that I have this column A:

A
融360
הבנק הבינלאומי
АТ "Акцент-Банк
АО АЛСЕКО
АО "Altyn Bank"
АКБ "Kapitalbank
Şekerbank
İninal
Československá obchodní banka
iBillionaire
iBear
iBankマーケティング株式会社
4finance
11번가(주)
11com7 design
1 2 3 Apps
1 2 3 Apps
(주)인포바인

I want to use QUERY in another column combined with WHERE MATCHES in order to be able to use the Latin Unicode class, and I want the QUERY function to give results only when non-Latin letter characters appear. That is, I want something like this as a result in column B:

A:B
融360:融360
הבנק הבינלאומי:הבנק הבינלאומי
АТ "Акцент-Банк:АТ "Акцент-Банк
АО АЛСЕКО:АО АЛСЕКО
АО "Altyn Bank":АО "Altyn Bank" \\ These are Cyrillic A and O
АКБ "Kapitalbank:АКБ "Kapitalbank
Şekerbank:#N/A
İninal:#N/A
Československá obchodní banka:#N/A
iBillionaire:#N/A
iBear:#N/A
iBankマーケティング株式会社:iBankマーケティング株式会社
4finance:#N/A
11번가(주):11번가(주)
11com7 design:#N/A
1 2 3 Apps:#N/A
1 2 3 Apps:#N/A
(주)인포바인:(주)인포바인

I'm doing this with this formula: =QUERY(A;"select A where A matches 'SomeREGEX'";0), but I don't seem to get the correct regular expression. After a lot of unsuccessful attempts, I tried [\p{Latin}\d\s]*[^\p{Latin}]+[\p{Latin}\d\s]* that gives a correct answer to АО АЛСЕКО, Československá obchodní banka, 11번가(주), 11com7 design and (주)인포바인, but not with АО "Altyn Bank", АКБ "Kapitalbank or iBankマーケティング株式会社

What I might be doing wrong?

1 Answers1

0

You may use

=REGEXMATCH(A1,".*([\x{0080}-\x{02AF}]|\d.*[a-zA-Z]|[a-zA-Z].*\d).*|^[a-zA-Z]+$")

See the regex demo

Details

  • .* - any 0 or more chars other than line break chars, as many as possible
  • ([\x{0080}-\x{02AF}]|\d.*[a-zA-Z]|[a-zA-Z].*\d) -
    • [\x{0080}-\x{02AF}] - a character from U+0080 to U+02AF range
    • | - or
    • \d.*[a-zA-Z]|[a-zA-Z].*\d - a digit, any 0+ chars as many as possible, ASCII letter, or an ASCII letter, then 0 or more chars as a digit after
  • .* - any 0 or more chars other than line break chars, as many as possible
    • | - or
  • ^ - start of string
  • [a-zA-Z]+ - 1 or more ASCII letters
  • $ - end of string
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563