3

I am attempting to split an example paragraph into sentences using regex in Power Query:

Mr. and Mrs. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Dr. Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.However, this line wont do it. Qr. Test for Website.COM and Labs.ORG looks good.Creatively not working. t and finished. 9 to start

Into:

Mr. and Mrs. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.

Did he mind? Dr. Adam Jones Jr. thinks he didn't.

In any case, this isn't true...

Well, with a probability of .9 it isn't.

However, this line wont do it.

Qr.

Test for Website.

COM and Labs.

ORG looks good.

Creatively not working. t and finished.

9 to start

Here is a function that enables PQ to utilise regex replace:

FnRegexReplace

// regexReplace
    let   regexReplace=(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) as text =>
        let
            f=if flags = null or flags ="" then "" else flags,
            l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\\")),
            l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
            t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
            r=Web.Page(t)[Data]{0}[Children]{0}[Children],
            Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
        in Output
    in regexReplace

I also have the following regex provided prom a previous post which appears to work on Regex101.

https://regex101.com/r/WEC0M9/6

Pattern: (?<!Mr|Mrs|Dr|Jr)(\.+)(\s+(?![a-z])|(?=[A-Z]))

Replace: $1\r\n (I think this can be anything like *)

flags: gm

The issue I have is that when I attempt this is Power Query I am returned with no result:

enter image description here

Alternatively (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s can be found here but the same issue occurs.

The issue appears to lie with the look-backwards and look-forward respectively ? as the function at least returns a result when this is removed. If anyone can advice on how to best get this paragraph to split using regex as shwon above in PQ that would be great.

M Code: 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("TY9BSwNBDIX/yqPnJVRBiicRetBCD1LBQ+1hdid1Qmcmy0zWpf/e2aLgMXnv5X05Hlf7QnDZY18q4ZDEAnqdvoJhCOzGKsY0aMJZC+7oAUliFM3wGqMrtYMQEwJjdOLhENVuXjHCtm2akiT7J2xb0bN3CTvNXLFrowXJl7pYvPj8Oa3X95sWe82N6IrBVe4WT4XUPxVWJiYifHCMHeaF12Es2rteotgVegY9tvp/IXrRmb+5/F6LkhmzZmtP3DjfGss7V1udTj8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "FnRegexReplace", each FnRegexReplace([Column1], "(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", "$1\r\n", "gm"))
in
    #"Invoked Custom Function"

Update1: M Code with proposed Regex:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("TY9BSwNBDIX/yqPnJVRBiicRetBCD1LBQ+1hdid1Qmcmy0zWpf/e2aLgMXnv5X05Hlf7QnDZY18q4ZDEAnqdvoJhCOzGKsY0aMJZC+7oAUliFM3wGqMrtYMQEwJjdOLhENVuXjHCtm2akiT7J2xb0bN3CTvNXLFrowXJl7pYvPj8Oa3X95sWe82N6IrBVe4WT4XUPxVWJiYifHCMHeaF12Es2rteotgVegY9tvp/IXrRmb+5/F6LkhmzZmtP3DjfGss7V1udTj8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "FnRegexReplace", each FnRegexReplace([Column1], "((?:\S+\.(?:net|org|com)\b|\b[mdjs]rs?\.|\d*\.\d+|[a-z]\.(?:[a-z]\.)+|[^?.!])+(?:[.?!]+|$))[?!.\s]*)", "$1\n", "gi"))
in
    #"Invoked Custom Function"
Nick
  • 789
  • 5
  • 22
  • sadly the same blank result returns for me – Nick Jul 07 '22 at 11:34
  • Then my 2nd best guess is lookbehind is simply not supported (just as Excel's VBA). – JvdV Jul 07 '22 at 11:35
  • I just tested replacement from JvdV in regex101. Any idea why the result is different? before: https://regex101.com/r/WEC0M9/6 (V6) vs after: https://regex101.com/r/WEC0M9/9 (V9) – Anonymous Jul 07 '22 at 11:54
  • @Anonymous, my suggested pattern was just not great. The nested negative lookbehind where not mutuall excluding. Thanks for testing =) – JvdV Jul 07 '22 at 11:59
  • 2
    The `Web` environment uses the javascript regex flavor which does not support look-behind. You have also tagged `Power BI`. Power Query in Power BI can run Python and R scripts, both of which support not only look-behind but also the re.split method. – Ron Rosenfeld Jul 07 '22 at 12:09
  • @JvdV thank you for the explanation. Just keen on learning as a not regular regex user. – Anonymous Jul 07 '22 at 12:21
  • Anyhow, maybe OP can use `((?:\S+\.(?:net|org|com)\b|\b[mdjs]rs?\.|\d*\.\d+|[a-z]\.(?:[a-z]\.)+|[^?.!])+(?:[.?!]+|$))[?!.\s]*` instead. It probably would need some more work to catch more quirks. Replace with `$1\n` and case-insensitive flag. – JvdV Jul 07 '22 at 12:29
  • @JvdV So ive tested this out and still the same result. I have posted an update above if you want to take a look in Power Query – Nick Jul 07 '22 at 12:39
  • I was about to delete the whole entry from regex101.com. Now I just added the above as version 10. @JVdV the only other way to exclude and not using look back/ahead I found is something like this ```[^a-z]``` (I used it in version 8) but no idea whether is possible and how to apply this instead of the look back part. The look forward part I was able to replace. – Anonymous Jul 07 '22 at 12:56
  • @Nick, not sure how it worked exactly but I noticed that doing a replace on a whole match would work, an error occured if there are multiple submatches. A saver bet would be to use `R` or `Python` scripts as suggested by Ron. – JvdV Jul 07 '22 at 13:40
  • @JvdV would you mind posting and example of code where the whole match works? – Nick Jul 07 '22 at 14:16
  • Python and R may very well be the way forward for my own personal use. I am however trying to develop an excel based tool that others who don't have access to Power BI can use. – Nick Jul 07 '22 at 14:26
  • 1
    @Nick, I'm reading about possibly altering the Javascript based function. I just don't get around doing it today. Possibly tomorrow (hopefully someone else has the answer ready for you by then.) – JvdV Jul 07 '22 at 17:06
  • I too have considered this, but it’s a bit beyond me right now – Nick Jul 07 '22 at 22:44

2 Answers2

1

I think the following will be helpfull:

enter image description here

  • For demonstration purposes I loaded the data directly from Excel. I'm sure you can figure out how to connect your PDF;

  • Since the JavaScript-based function is a small HTML-script we have to escape the apostrope in the sample text first using a replace function. Otherwise it will clash with the apostrophes used to write the script in the function (see below). If we don't the function will error out/show nothing. Apostrophe will be shown correctly after applying function;

  • I edited the pattern to catch a full sentence in 1st capture group and for this sample I replaced what is captured with the backreference to this group and a pipe-symbol to visualize the result. Note there is no use of a negative lookbehind nomore since that is not supported in the engine. This resulted in a lengthy pattern which probably does not yet catch all the quirks possible:

    \s*((?:\b[MDJS]rs?\.|\d*\.\d+|\S+\.(?:com|net|org)\b|[a-z]\.(?:[a-z]\.)+|[^.?!])+(?:[.?!]+|$))
    

M-Code:

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Kol", type text}}),
    #"Replaced Value" = Table.ReplaceValue(#"Changed Type","'","&apos",Replacer.ReplaceText,{"Kol"}),
    #"Invoked Custom Function" = Table.AddColumn(#"Replaced Value", "fnRegexReplace", each fnRegexReplace([Kol], "\\s*((?:\\b[MDJS]rs?\\.|\\d*\\.\\d+|\\S+\\.(?:com|net|org)\\b|[a-z]\\.(?:[a-z]\\.)+|[^.?!])+(?:[.?!]+|$))", "$1|"))
in
    #"Invoked Custom Function"

Used function fnRegexReplace:

(x,y,z)=>
let 
   Source = Web.Page(
                     "<script>var x="&"'"&x&"'"&";var z="&"'"&z&
                     "'"&";var y=new RegExp('"&y&"','gmi');
                     var b=x.replace(y,z);document.write(b);</script>")
                     [Data]{0}[Children]{0}[Children]{1}[Text]{0}
in 
   Source

An online demo of the regular expression.

JvdV
  • 70,606
  • 8
  • 39
  • 70
  • Hi there, fine about loading straight from excel as per this question ( though you're right it is ultimately for PDFs), However, where is the Kol Column coming from? – Nick Jul 08 '22 at 10:01
  • 1
    Apologies, confusion that Kol was the original text column. Its working now and looks great. Will test with other texts to explore how dynamic this is – Nick Jul 08 '22 at 10:09
  • Tested it out on some large pieces of text and it works really well. the only issue I have found is this text https://www.varsitytutors.com/praxis_reading-help/long-passages-200-400-words. the ed. in (1902, ed. David Starr Jordan) is split. But tbh i can't see how this could be logically separated from say the end of a sentence like: He was `in.` Appears to handle everything else brilliantly though. – Nick Jul 08 '22 at 10:53
  • @Nick, if you are able to read regex well enough there is also the option to add to the pattern as you go and find these quirks. I assume you can also have abbreviations like `St.` for example that are currently un-accounted for. There are also probably a ton of different top-level-domains (TLD's) we currently didn't account for, etc... – JvdV Jul 08 '22 at 11:05
  • 1
    @Nick, *"When a string is double quoted, it is processed by the compiler and again at run-time. Since a backslash (\) is removed whenever the string is processed, the double-quoted string needs double backslashes so that there is one left in the string at run time to escape a "special character". "* From [here](https://www.ibm.com/support/pages/when-use-double-backslash-escape-special-character-regular-expression). Hope it helps. Certainly not all compilers need this AFAIK. – JvdV Jul 08 '22 at 11:29
  • 1
    Adding `y= Text.Replace(y,"\","\\"),` in your function solves this issue so you can use the original \. I'm just surprised other answers don't do this. Unless there is an issue doing this? – Nick Jul 08 '22 at 11:32
  • 1
    Well done in finding a regex avoiding the negative lookbehind ! – Anonymous Jul 08 '22 at 22:40
0

This regex works for most texts from the start but can accommodate for issues that may arise.

\s*((?:\b(?:[djms]rs?|flam|liq|st)\.|\b(?:[a-z]\.){2,}|\.\d|\.(?:com|net|org)\b|[^.?!])+(?:[.?!]+|$)) (Gmi as flags)

Where flam|liq|st are examples where a split would normally occur if the word is followed by a . e.g. for an abbreviation. This section of the regex forced these to be ignored. e.g. If you had the text, St. Bernards typically weigh 80kg. This would usually split on the St. However adding st to this region of the regex ignores this so the sentence is captured as a whole. You can keep adding to this section to try and accommodate for most errors. If you come up with anyway of improving on this further do please post a comment/answer.

Nick
  • 789
  • 5
  • 22