5

I need Regexp, while M / Power Query doesn't have native support to it. I found several variants of solution around the same excellent Web.Page & JavaScript idea - Biccauntant, Hugoberry.

I had to adopt them (the resulting code is below) due to JavaScript limits.

The main problem is that JavaScript has it's own tuff limit on the String variables - it is not possible to use "\" in it.

My variant tackles the problem, but I'm not professional in JS, so the main question is - I suspect that the "\" problem is not the only one.

So my question is - does anybody see some other problems with PQ & JavaScript "co-operation" while using the Regexp? Regarding other pitfalls with strings, or whatever?

I understand that Regexp has it's own escaping rules (this is not only JavaScript demand, but Regexp itself), so escaping of Regex itself is out of the scope of the question. I.e. regular expression should be properly escaped BEFORE it may be passed as the function parameter. In other words, it is supposed, that if a user wants to use regular expression with "\", he must use "\\" instead, and my function will convert it to "\\\\", which will be passed to JavaScript as Regexp expression.

For PQ users - it turned out that the performance of the solution is quite good for hundreds of rows at least. But not forget that it is not possible to use the function in Power BI service, this is for PowerBI Desktop and Excel only.

The code:

(text as nullable text, pattern as nullable text) as logical =>
    let 
        l = List.Transform({text, pattern}, each Text.Replace(_, "\", "\\")), 
        t = Text.Format("<script>document.write(new RegExp('#{1}').test('#{0}'))</script>", l),
        w = Web.Page(t), 
        d = w[Data]?{0}?[Children]?{0}?[Children]?{1}?[Text]?{0}?, 
        result = text <> null and (pattern = null or (if d <> null then Logical.FromText(d) else error "Regular expression or text are not supported by JavaScript."))
    in
        result
Andrey Minakov
  • 545
  • 2
  • 5
  • 19

2 Answers2

3

I know javascript, but I don't know Powerbi/Powerquery so this answer may be incomplete.

There are other characters which need to be escaped in string literals.

For single-quoted string literals, \ and ' need to be escaped, as do some whitespace characters such as newline. There's a list of escapes at https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String . Some of these are optional, but I'm not sure exactly which ones. I don't think you ever need any of the ones with hex digits. I said that ' needs to be escaped but not ", but if you were using a double-quoted string ("#{0}") it would be the other way round.

Apart from that, I can't tell you what escaping needs to be done for Powerquery. In fact I'm surprised that your string literal "\\" worked as intended.

In summary, it should look something like this (and feel free to correct my Powerquery syntax errors):

// Define function escapeString which inserts \ before each \ or ' and then applies the escape sequences \n, \r, \v, \t, \b and \f.
let 
    l = List.Transform({text, pattern}, each escapeString(_)), 
...
David Knipe
  • 3,417
  • 1
  • 19
  • 19
  • David, thanks a lot for your great answer, that's what I needed. Sorry for not mentioning that it is supposed that regular expression must conform Regex escaping rules BEFORE it is passed to the function (i.e. function works only with correct from any point of view Regexp expressions). I updated the question accordingly. – Andrey Minakov Aug 18 '19 at 21:06
  • No, I should have realised this. What I wrote was for using regex search to find a particular string. The argument was called "pattern", not "searchString" or similar, so it clearly has the potential to contain regex special characters which shouldn't be escaped. I've now deleted all the stuff about escaping for regexes. – David Knipe Aug 18 '19 at 21:30
2

Based on the code above (and correctly escaping single quotes '), I suggest the following functions (I include 3 functions, one for testing, one for matching and the other one for replacements).

NOTE: There is one issue that I am unable to fix. In the matching function the list returned will be incorrect if the string or matches contain content with a comma (','). I looked at using JSON.stringify() to output the match but it does not work, will open a question here and fix it later if possible.

Function to do regular expression testing

// regexTest
let   regexTest=(text as nullable text,replace as nullable text, optional flags as nullable text) as logical =>
    let
        f= if flags = null or flags ="" then "" else flags,
        l1 = List.Transform({text, replace}, each Text.Replace(_, "\", "\\")),
        l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
        l3 = List.Combine({l2,{f}}),
        t = Text.Format("<script>document.write(new RegExp('#{1}','#{2}').test('#{0}'))</script>", l3),
        r = Web.Page(t)[Data]{0}[Children]{0}[Children]{1}[Text]{0},
        Output = Logical.From(r)
    in Output
    
in regexTest

Function to do regular expression matching

// regexMatch
let   regexMatch=(text as nullable text, pattern as nullable text, optional flags as nullable text) as list =>
    let
        f=if flags = null or flags ="" then "" else flags,
        l1 = List.Transform({text, pattern}, each Text.Replace(_, "\", "\\")),
        l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
        t = Text.Format("<script>var txt='#{0}';document.write(txt.match(new RegExp('#{1}','#{2}')));</script>", List.Combine({l2,{f}})),
        r=Web.Page(t)[Data]{0}[Children]{0}[Children],
        Output=if List.Count(r)>1 then Text.SplitAny(r{1}[Text]{0},",") else {}
    in Output
in regexMatch

Function to do Regular expression replacing

// regexReplace
let   regexReplace=(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) as text =>
    let
        f=if flags = null or flags ="" then "" else flags,
        l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\\")),
        l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
        t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
        r=Web.Page(t)[Data]{0}[Children]{0}[Children],
        Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
    in Output
in regexReplace
GMCB
  • 187
  • 2
  • 15
  • 1
    You nailed it, thank you very much. I'd just make two humble recommendations: 1) What you mentioned as "Function to do regular expression matching" I'd name the function as "RegexExtract", once it brings you the content of the matches. 2) What you mentioned as "Function to do regular expression testing" I'd name it as "RegexMatch", once it will return you true or false. (At least this is the way I've seen them to be called around there recently, ex: Google Sheets, Python, Perl..., etc, for the three main regex operations: match, extract matchings, replace matchings). – Gabz Mar 25 '22 at 20:54
  • @GMCB Just tested this out on a provided regex and it appears that look-backwards and look-forward respectively `?<!` isn't recognised. Would be great to know your thoughts on this. – Nick Jul 08 '22 at 08:33