1

I wish to separate the following data using regex function as follows:

enter image description here

Function to be used:

let   fx=(text,regex)=>
    Web.Page(
        "<script>
            var x='"&text&"';
            var y=new RegExp('"&regex&"','g');
            
            var b=x.match(y);
            document.write(b);
        </script>")[Data]{0}[Children]{0}[Children]{1}[Text]{0}

in
fx

Agurments:

  • text - Column1
  • List item

regex - \\d+\\.?\\d+

This successfully extracts the numerical values, however:

  1. I am unsure if this is the correct regex to remove the first integer/number.
  2. I am unsure how to use regex to extract only the units. It seems to run into errors despite various attempts. e.g. \D+ doesn't return the non-numerical values despite it working on the link. This being said for 15 ng/m3 if this did work it would only return ng/m3. I wonder if there is an issue with the function itself.

M Code:

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Invoked Custom Function1" = Table.AddColumn(#"Changed Type", "fnRegexExtr2", each fnRegexExtr2([Column1], "\\d+\\.?\\d+")),
    #"Invoked Custom Function" = Table.AddColumn(#"Invoked Custom Function1", "fnRegexExtr2.1", each fnRegexExtr2([fnRegexExtr2], "\\D+"))
in
    #"Invoked Custom Function"

update with "^[^\s]+":

enter image description here

data:

1200 mg/kg bw/day
24 mg/kg/day
0.79 mg/kg bw/day
15 ng/m3
15 ng/m 3
Not Limited
 30mg/m³
Nick
  • 789
  • 5
  • 22
  • 1
    Select Column1 > 'Start' tab > 'Split Column' on leftmost space. And if numbers are not at the start, then please provide sample data with expected results. – JvdV May 12 '22 at 13:02
  • 1
    @JvDv has the way = Table.SplitColumn(#"PriorStep", "Column1", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"Column1.1", "Column1.2"}) – horseyride May 12 '22 at 13:05
  • Also, you are wondering about your pattern, but that basically states you would like to find any two digits (or more) anywhere in the text. If you must, then at least try: `^\\d+(?:\\.\\d+)?` to assert the digits are at the beginning of the string. – JvdV May 12 '22 at 13:17
  • I appreciate it can be done using split with this example but I am trying to understand how to achieve this using regex. – Nick May 12 '22 at 13:17
  • 1
    @JvdV updated image and provided example data. – Nick May 12 '22 at 13:26
  • have you played with https://regex101.com/ – horseyride May 12 '22 at 13:47
  • @horseyride I have been but this is as far as I have gotten. As above I have extracted the units, and this seems okay but not sure if the best way to achieve this. Now its the units I am having trouble with. – Nick May 12 '22 at 13:51

2 Answers2

2

There is no issue with the function itself, the problem lays with the patterns used:


Numeric part: You currently use \\d+\\.?\\d+ which basically means; "Any 1+ digit followed by an optional dot and at least another 1+ digits". So, at least two digits anywhere in a string. The appropriate regex would be:

^\\d+(?:\\.\\d+)?

Meaning:

  • ^ - Start-line anchor;
  • \\d+ - 1+ (Greedy) digits;
  • (?:\\.\\d+)? - Optional non-capture group to match a literal dot followed by at least 1+ more digits.

Units: Your pattern [\D+][2] matches a single character from the 1st class that is either a non-digit or a literal plus sign. Your 2nd character class matches a literal 2. So you are looking for patterns like 'A2' or '+2' etc. The appropriate regex would be dependend on your input.


Proposal:

In a previous answer I already suggested a different JS-based function, to replace data rather than to match data. Therefor add:

(x,y,z)=>
let 
   Source = Web.Page(
                     "<script>var x="&"'"&x&"'"&";var z="&"'"&z&
                     "'"&";var y=new RegExp('"&y&"','g');
                     var b=x.replace(y,z);document.write(b);</script>")
                     [Data]{0}[Children]{0}[Children]{1}[Text]{0}
in 
   Source

Now for both columns try pattern:

^(\\d+(?:\\.\\d+)?)?\\s*(.+)$

Replace the values with $1 for the numeric part, and with $2 for the leftover unit.

enter image description here

let
    Source = Excel.CurrentWorkbook(){[Name="Tabel1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Invoked Custom Function1" = Table.AddColumn(#"Changed Type", "Nr", each fnRegexExtr([Column1], "^(\\d+(?:\\.\\d+)?)?\\s*(.+)$", "$1")),
    #"Invoked Custom Function2" = Table.AddColumn(#"Invoked Custom Function1", "Unit", each fnRegexExtr([Column1], "^(\\d+(?:\\.\\d+)?)?\\s*(.+)$", "$2")),
    #"Replaced Errors" = Table.ReplaceErrorValues(#"Invoked Custom Function2", {{"Nr", null}, {"Unit", null}})
in
    #"Replaced Errors"

A 2nd option would be to replace the value with a delimiter where you later split on:

let
    Source = Excel.CurrentWorkbook(){[Name="Tabel1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "fnRegexExtr", each fnRegexExtr([Column1], "^(\\d+(?:\\.\\d+)?)?\\s*(.+)$", "$1|$2")),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Invoked Custom Function", "fnRegexExtr", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), {"Nr", "Unit"})
in
    #"Split Column by Delimiter"
JvdV
  • 70,606
  • 8
  • 39
  • 70
  • 1
    Ah okay, Thanks for this. Apologies too [\D+][2] was a typo when trying to add a hyperlink and is nothing to do with my regex. – Nick May 12 '22 at 13:57
  • Just one nigglign example that esacpes this logic. If I wanted to capture `1 295 mg/kg bw/day` with the accidental space. How could I do this? – Nick May 12 '22 at 15:58
  • @Nick, try to use `^(\\d+(?:[ .]\\d+)?)?\\s*(.+)$` instead then – JvdV May 12 '22 at 16:16
0

Thanks for posting. Looks like you need to match characters up the first space. For that you can use the following regex expression.

^[^\s]+
Bumbl3b33
  • 5
  • 1
  • 4
  • Hi thanks for getting back to me. I've update the image to include a few extra scenarios. with `^[^\\s]+` (excel requires \\) this works okay but 0.79 mg/kg dw/day is returned as 0.79 mg/kg. Also with the updates image this returns Not Limited --> Not and 30mg/m3 --> null – Nick May 12 '22 at 13:29
  • I have included an update showing the result of your formula. Hope this is useful. – Nick May 12 '22 at 13:30