1

I need to use a formula in a Google Sheets cel that will extract the FIRST brazilian phone number from a data in another cel. The phone number will be in the middle of a html code such as:

  1. Example 1: <a href="tel:1123456789" class="btn-tel">(11) 2345-6789</a> >>> Would return the first number:: 1123456789
  2. Example 2: <strong>3003-1234</strong> >>> Would return: 3003-1234
  3. Example 3: <a _ngcontent-iqq-c82="" href="tel:30031234" class="ng-tns-c82-0">3003-1025 </a> >>> Would return: 30031234

There are 3 different types of numbers:

  • 0800abcdefg: where abcdefg could be numbers from 0 to 9 and this number could be presented as: 0800 abcdefg, 0800 abc defg or 0 800 abc defg.
  • XXXXabcd: where XXXX could be 3003, 3004, 4003, 4004 or 4020 and abcd could be numbers from 0 to 9. This type of number could be presented as: XXXX-abcd, XXXXabcd, XXXX abcd or XXXX.abcd
  • XXabcdefgh: where XX is a number that never has 0 (no 10,03,30...) and cannot be one of the following numbers: 23,25,26,29,36,39,52,56,57,58,59,72,76 or 78. The a number can be only 2-5 and b, c, d, e, f, g and h can be a number from 0 to 9. This number can be presented as: (XX)abcdefgh, (XX)abcd.efgh, (XX) abcd-efgh, XXabcdefgh, XX abcd efgh, XX.abcd.efgh, XX abcd-efgh

I tried using the following Regex with the formula REGEXTRACT in Google Sheets but it returns error saying that it is not a valid regex:

(?!23|25|26|29|36|39|52|56|57|58|59|72|76|78)(?:(((?)([1-9]{2})()?)( ?)[2-5])[0-9]{3}|3003|3004|4003|4004|4020|0( ?)800)(-?|.?| ?[0-9]{3})(-?|.?| ?)[0-9]{4}

Although it may not be an optimal regex, I have tried it on https://regex101.com/ and it worked just fine. BUt maybe it uses tokens not allowed in Google Sheets.

If that is not possible, the regex can consider that the numbers will be presented only as numbers, such as: 0800abcdefg,XXXXabcd or XXabcdefgh.

Aps
  • 13
  • 3
  • 1
    `(?)` is not a valid token in ECMAScript flavor, which Google sheets use. – InSync Mar 21 '23 at 11:35
  • Neither is a negative lookahead which btw could be rewritten like `(?!2[3569]|3[69]|5[26-9]|7[268])` to condense the options a bit. Do you have some sample data with expected results to work with? – JvdV Mar 21 '23 at 11:43
  • @JvdV ere are some sample data with expected results: Example 1: (11) 2345-6789 >>> Would return both: 1123456789 and (11) 2345-6789 Example 2: 3003-1234 >>> Would return: 3003-1234 Example 3: 3003-1025 >>> Would return: 30031234 – Aps Mar 21 '23 at 12:09
  • Hi @Destroy666, I need to use a formula in a Google Sheets cel that will extract a brazilian phone number from a data in another cel. The phone number will be in the middle of html code such as the examples before. There are 3 different types of numbers 0800abcdefg, XXXXabcd (where XXXX could be 3003, 3004, 4003, 4004 or 4020) and XYabcdmcdu (where a and b are between 1 and 9 and cannot be 23, 24, 25, ...). The numbers can be just numbers or formated for example: 3003-1234, (11) 2345.6789, 0800 123 4567, ... I tried your regex in google sheets but got the same error of invalid regex. – Aps Mar 21 '23 at 15:52
  • Ok, so it's not ECMAScript like someone mentioned, it actuall uses `RE2` engine from what I see. Let me see if I can quickl craft something. Anyways, please edit the question with additional info like that (examples of input/output, better description). – Destroy666 Mar 21 '23 at 16:06
  • Quick look and I don't think what you request with multiple matches is possible/viable as `REGEXEXTRACT` just returns all groups of the first match or the entire match if there are no groups. So you would need to know how many phone numbers there are exactly or at most. There are way better tools than Google Sheets for this task. – Destroy666 Mar 21 '23 at 16:26
  • @Destroy666 would it work if the formula returns just the first number it finds or if we assume that the numbers are always presented as just numbers (no () - . ..)? – Aps Mar 21 '23 at 16:51
  • Yes, if you want and edit your question to match that requirement then I can post a solution for that. But the negative lookahead part will need to also be skipped/reworked, not sure what the point of that was, still. As you wrote that `ab` can't be `23` etc. but it looks in your initial regex as if it was for `XY`. – Destroy666 Mar 21 '23 at 16:54
  • Done @Destroy666. In my initial regex the first part was to not consider the numbers that don't exist, so it was excluding numbers that would have 23, 25, 26, 29, 36, ... – Aps Mar 21 '23 at 17:09

2 Answers2

0

This convoluted regex works:

0\s*800\d{3}\s*\d{4}|(?:300[34]|400[34]|4020)[\s\.-]?\d{4}|\(?(?:[14689]\d|2[^356\D]|3[^69\D]|5[^267-9\D]|7[^268\D])\)?\s*\d{4}[\s\.-]?\d{4}

We split it into 3 alternatives separated by | for your 3 requirements:

  • 0\s*800\d{3}\s*\d{4} - this matches 0800 or 0 800 followed by 3 and 4 digits possibly split by whitespaces
  • (?:300[34]|400[34]|4020)[\s\.-]?\d{4} - matches 3003, 3004, 4003, 4004 or 4020 and then 4 digits separated by one whitespace, . or -. We use non-capturing group for alternatives so that REGEXEXTRACT considers the whole match, else it would consider all the groups of 1st match.
  • \(?(?:[14689]\d|2[^356\D]|3[^69\D]|5[^267-9\D]|7[^268\D])\)?[\s\.]?\d{4}[\s\.-]?\d{4} - is the messy part that first all the possible two-digit codes optionally wrapped by (). Instead of using impossible negative lookahead, we use negations of digits that we don't want as combos and also include \D non-digits in the sets. Then it's similar to the one above, but instead matches 4 digits rather than specific numbers for the first 4 characters.

Demo

Alternatives 2 and 3 could perhaps be combined to shorten the regex, but I'm not bothering since it's just for sheets anyways. And this long syntax makes sure all the possibilities are considered how they need to be.

EDIT: and to match multiple with the regex it turns out you can apparently try creating your own function like this one.

Destroy666
  • 892
  • 12
  • 19
0

You may try:

=let(Σ,join("|","0800\d{7}","3003\d{4}","3004\d{4}","4003\d{4}","4004\d{4}","4020\d{4}",
            join("[2-5]\d{7}|",filter(sequence(88,1,11,1),iserror(xmatch(sequence(88,1,11,1),{23,25,26,29,36,39,52,56,57,58,59,72,76,78,20,30,40,50,60,70,80,90}))))&"[2-5]\d{7}"),
     index(if(len(A:A),ifna(regexextract(regexreplace(regexreplace(A:A&"","[^(\d\s-\.)]",""),"[^\d]",""),Σ),"-"),)))
  • Please do test it with bit more data samples to make sure there's not gonna be any false positives

enter image description here

rockinfreakshow
  • 15,077
  • 3
  • 12
  • 19