I am attempting to split an example paragraph into sentences using regex in Power Query:
Mr. and Mrs. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Dr. Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.However, this line wont do it. Qr. Test for Website.COM and Labs.ORG looks good.Creatively not working. t and finished. 9 to start
Into:
Mr. and Mrs. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind? Dr. Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
However, this line wont do it.
Qr.
Test for Website.
COM and Labs.
ORG looks good.
Creatively not working. t and finished.
9 to start
Here is a function that enables PQ to utilise regex replace:
FnRegexReplace
// regexReplace
let regexReplace=(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) as text =>
let
f=if flags = null or flags ="" then "" else flags,
l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\\")),
l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
r=Web.Page(t)[Data]{0}[Children]{0}[Children],
Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
in Output
in regexReplace
I also have the following regex provided prom a previous post which appears to work on Regex101.
https://regex101.com/r/WEC0M9/6
Pattern: (?<!Mr|Mrs|Dr|Jr)(\.+)(\s+(?![a-z])|(?=[A-Z]))
Replace: $1\r\n
(I think this can be anything like *
)
flags: gm
The issue I have is that when I attempt this is Power Query I am returned with no result:
Alternatively (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
can be found here but the same issue occurs.
The issue appears to lie with the look-backwards and look-forward respectively ?
as the function at least returns a result when this is removed. If anyone can advice on how to best get this paragraph to split using regex as shwon above in PQ that would be great.
M Code:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("TY9BSwNBDIX/yqPnJVRBiicRetBCD1LBQ+1hdid1Qmcmy0zWpf/e2aLgMXnv5X05Hlf7QnDZY18q4ZDEAnqdvoJhCOzGKsY0aMJZC+7oAUliFM3wGqMrtYMQEwJjdOLhENVuXjHCtm2akiT7J2xb0bN3CTvNXLFrowXJl7pYvPj8Oa3X95sWe82N6IrBVe4WT4XUPxVWJiYifHCMHeaF12Es2rteotgVegY9tvp/IXrRmb+5/F6LkhmzZmtP3DjfGss7V1udTj8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
#"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "FnRegexReplace", each FnRegexReplace([Column1], "(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", "$1\r\n", "gm"))
in
#"Invoked Custom Function"
Update1: M Code with proposed Regex:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("TY9BSwNBDIX/yqPnJVRBiicRetBCD1LBQ+1hdid1Qmcmy0zWpf/e2aLgMXnv5X05Hlf7QnDZY18q4ZDEAnqdvoJhCOzGKsY0aMJZC+7oAUliFM3wGqMrtYMQEwJjdOLhENVuXjHCtm2akiT7J2xb0bN3CTvNXLFrowXJl7pYvPj8Oa3X95sWe82N6IrBVe4WT4XUPxVWJiYifHCMHeaF12Es2rteotgVegY9tvp/IXrRmb+5/F6LkhmzZmtP3DjfGss7V1udTj8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
#"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "FnRegexReplace", each FnRegexReplace([Column1], "((?:\S+\.(?:net|org|com)\b|\b[mdjs]rs?\.|\d*\.\d+|[a-z]\.(?:[a-z]\.)+|[^?.!])+(?:[.?!]+|$))[?!.\s]*)", "$1\n", "gi"))
in
#"Invoked Custom Function"