I have been exploring the idea of using Power BI to dynamically try to extract information from any PDF if given a URL.
In this case the URL is https://hpvchemicals.oecd.org/ui/handler.axd?id=fae8d1b1-406b-4287-8a05-f81aa1b16d3f which is a safety assessment profile for formaldehyde. I wish to be able to extract as many sentences from a PDF document as possible.
Assuming ". " identifies the end of a sentence this actually works really well where paragraphs are concerned, along with some other trickery, to split the entire PDFinto sentences which I can then search.
M Code (updated for improved Extraction):
let
Source = Pdf.Tables(Web.Contents("https://hpvchemicals.oecd.org/ui/handler.axd?id=fae8d1b1-406b-4287-8a05-f81aa1b16d3f"), [Implementation="1.3"]),
#"Expanded Data" = Table.ExpandTableColumn(Source, "Data", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded Data", each ([Kind] = "Page")),
#"Removed Columns1" = Table.RemoveColumns(#"Filtered Rows",{"Name", "Kind"}),
#"Added Index" = Table.AddIndexColumn(#"Removed Columns1", "Index", 0, 1, Int64.Type),
Exclude={"Id"},
List=List.Difference(Table.ColumnNames(#"Removed Columns1"),Exclude),
MergeAllColumns= Table.AddColumn(#"Added Index","Custom", each Text.Combine(Record.ToList( Table.SelectColumns(#"Added Index",List){[Index]}), " ")),
#"Removed Other Columns" = Table.SelectColumns(MergeAllColumns,{"Id", "Custom"}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("#(lf)", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
#"Replaced Value" = Table.ReplaceValue(#"Split Column by Delimiter",". ",".. #",Replacer.ReplaceText,{"Custom"}),
#"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Replaced Value", {{"Custom", Splitter.SplitTextByDelimiter(". ", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
#"Cleaned Text" = Table.TransformColumns(#"Split Column by Delimiter1",{{"Custom", Text.Clean, type text}}),
#"Added Index1" = Table.AddIndexColumn(#"Cleaned Text", "Index", 0, 1, Int64.Type),
#"Added Custom" = Table.AddColumn(#"Added Index1", "Custom.1", each if Text.Contains([Custom], "#") then [Index] else null),
#"Filled Down" = Table.FillDown(#"Added Custom",{"Custom.1"}),
#"Grouped Rows" = Table.Group(#"Filled Down", {"Custom.1"}, {{"Custom", each Text.Combine([Custom], " "), type text}}),
#"Split Column by Delimiter2" = Table.ExpandListColumn(Table.TransformColumns(#"Grouped Rows", {{"Custom", Splitter.SplitTextByDelimiter(". ", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
#"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter2", "Custom.2", each if Text.Contains([Custom], "NOAEL") then [Custom] else null)
in
#"Added Custom1"
So from this:
to this:
Although this sentence splitting works pretty well using ". " I'm now being greedy and wondering if this can be done in an even better way.
There are some instances where the Sentence doesn't split correctly which could be improved. For example, if the end of a sentence is joined to the next e.g. Hello.How are you
would not split. Whilst ...e.g. Apple
itself is recognised would split into. ...e.g.
and Apple
.
Python - RegEx for splitting text into sentences (sentence-tokenizing)
Appears to propose doing this using regex however with this excel regex replace function this doesn't appear to work.
fnRegexExtr3 (doesn't require \\
just \
):
// regexReplace
let regexReplace=(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) as text =>
let
f=if flags = null or flags ="" then "" else flags,
l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\\")),
l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
r=Web.Page(t)[Data]{0}[Children]{0}[Children],
Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
in Output
in regexReplace
please feel free to look at the PDF link and propose any suggestions for improving the capturing of data.
I will continue to update here with any progress.