A Better way to Extract Sentences from PDFs using Power Query/BI

Question

I have been exploring the idea of using Power BI to dynamically try to extract information from any PDF if given a URL.

In this case the URL is https://hpvchemicals.oecd.org/ui/handler.axd?id=fae8d1b1-406b-4287-8a05-f81aa1b16d3f which is a safety assessment profile for formaldehyde. I wish to be able to extract as many sentences from a PDF document as possible.

Assuming ". " identifies the end of a sentence this actually works really well where paragraphs are concerned, along with some other trickery, to split the entire PDFinto sentences which I can then search.

M Code (updated for improved Extraction):

    let
    
    Source = Pdf.Tables(Web.Contents("https://hpvchemicals.oecd.org/ui/handler.axd?id=fae8d1b1-406b-4287-8a05-f81aa1b16d3f"), [Implementation="1.3"]),
    #"Expanded Data" = Table.ExpandTableColumn(Source, "Data", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Data", each ([Kind] = "Page")),
    #"Removed Columns1" = Table.RemoveColumns(#"Filtered Rows",{"Name", "Kind"}),
    #"Added Index" = Table.AddIndexColumn(#"Removed Columns1", "Index", 0, 1, Int64.Type),
    Exclude={"Id"},
    List=List.Difference(Table.ColumnNames(#"Removed Columns1"),Exclude),
    MergeAllColumns= Table.AddColumn(#"Added Index","Custom", each Text.Combine(Record.ToList( Table.SelectColumns(#"Added Index",List){[Index]}), "  ")),
    #"Removed Other Columns" = Table.SelectColumns(MergeAllColumns,{"Id", "Custom"}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("#(lf)", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Replaced Value" = Table.ReplaceValue(#"Split Column by Delimiter",". ",".. #",Replacer.ReplaceText,{"Custom"}),
    #"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Replaced Value", {{"Custom", Splitter.SplitTextByDelimiter(". ", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Cleaned Text" = Table.TransformColumns(#"Split Column by Delimiter1",{{"Custom", Text.Clean, type text}}),
    #"Added Index1" = Table.AddIndexColumn(#"Cleaned Text", "Index", 0, 1, Int64.Type),
    #"Added Custom" = Table.AddColumn(#"Added Index1", "Custom.1", each if Text.Contains([Custom], "#") then [Index] else null),
    #"Filled Down" = Table.FillDown(#"Added Custom",{"Custom.1"}),
    #"Grouped Rows" = Table.Group(#"Filled Down", {"Custom.1"}, {{"Custom", each Text.Combine([Custom], " "), type text}}),
    #"Split Column by Delimiter2" = Table.ExpandListColumn(Table.TransformColumns(#"Grouped Rows", {{"Custom", Splitter.SplitTextByDelimiter(". ", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter2", "Custom.2", each if Text.Contains([Custom], "NOAEL") then [Custom] else null)
in
    #"Added Custom1"

So from this:

to this:

Although this sentence splitting works pretty well using ". " I'm now being greedy and wondering if this can be done in an even better way.

There are some instances where the Sentence doesn't split correctly which could be improved. For example, if the end of a sentence is joined to the next e.g. Hello.How are you would not split. Whilst ...e.g. Apple itself is recognised would split into. ...e.g. and Apple.

Python - RegEx for splitting text into sentences (sentence-tokenizing)

Appears to propose doing this using regex however with this excel regex replace function this doesn't appear to work.

fnRegexExtr3 (doesn't require \\ just \):

    // regexReplace
let   regexReplace=(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) as text =>
    let
        f=if flags = null or flags ="" then "" else flags,
        l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\\")),
        l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
        t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
        r=Web.Page(t)[Data]{0}[Children]{0}[Children],
        Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
    in Output
in regexReplace

please feel free to look at the PDF link and propose any suggestions for improving the capturing of data.

I will continue to update here with any progress.

There are different flavors of RegEx incl. some syntax and feature differences. If I remember correctly MS follows the .NET flavor. You may want to check. — Anonymous, Jul 05 '22 at 11:24
I don't think you need RegEx. What are you trying to achieve exactly? Can you not split by line breaks to get a better output? — Davide Bacci, Jul 05 '22 at 11:26
@David The issue here is that excel interprets PDFs in a strange way and so all columns need to be merged to get all the sentences together as a first step but it's messy. Splitting by line break causes sentences which arent already together to split further. This can be seem on Query Setting `Removed Other Columns` Row 7. So the idea with regex to see if there was a better way to Identify the full stops which belong to sentences, replace these with a unique character (@) and then split on this. — Nick, Jul 05 '22 at 11:29
@David https://regex101.com/r/nG1gU7/27 Appears to try and do this but doesn't seem to agree with this function — Nick, Jul 05 '22 at 11:31
Is the desired output to just have one row per logical sentence. i.e. each sentence or title (without full stops) on a single row? — Davide Bacci, Jul 05 '22 at 11:32
@David yep pretty much. Anything that improves on this. Regex as you say may not be necessary, was just an idea which someone else looked at and I thought might work — Nick, Jul 05 '22 at 11:33
Notably it doesn't have to be perfect. I dont even think it can be perfect, just an improvement for identifying sentences. — Nick, Jul 05 '22 at 11:35
I think this one will be a struggle as you can't tell proper line breaks using carriage returns/line feeds. I doubt RegEx will be able to solve this either and it is down to the original formatting of the PDF. — Davide Bacci, Jul 05 '22 at 11:49
@David if we forget headings and just focus on paragraphs I think that makes it more feasible? — Nick, Jul 05 '22 at 11:51
I think you still have the same issue which is no way to differentiate between a heading and a sentence. Is there no other source for the info? — Davide Bacci, Jul 05 '22 at 11:56
@David No other source for this, but im also just generally interested in extracting data from any PDF. applying this method to random PDFs also appears to extract sentences pretty well. Its just those occasional Typos where there is a space missing that causes 2 sentences to be joined together. As I say im pretty happy with the result so far, I just want to see if I can improve the sentence detection to handle the occasional typo. — Nick, Jul 05 '22 at 12:01
If the headings are kind of standardized they can be put in use too, to further regex extract the sections, e.g. “RECOMMENDATIONS”, “SUMMARY CONCLUSIONS” as key words … — Anonymous, Jul 05 '22 at 13:12
@Anonymous Looking into another apporach using the headers. Will also look into your ideas — Nick, Jul 05 '22 at 13:13
@smpa01 identify Logical sentences and replace `.` that the end of that sentence with a unique character i.e. `#` It should ignore `.` if within things like `3.5` or `e.g. 10kg `. Perhaps the most tricky is a typo where there is no break after the `.` like `Hello.How are you`. I hope that helps. — Nick, Jul 05 '22 at 19:26
@Nick I just had a look to you linked regex. You may want to try ```\r\n``` for Windows environment. As for cases like ```Hello.How are you``` may just run against a 2nd regex after your previous one. — Anonymous, Jul 06 '22 at 04:55
You may have found this already: https://stackoverflow.com/questions/21431011/net-regex-for-full-stop-at-the-end-of-a-sentence — Anonymous, Jul 06 '22 at 04:56
@Anonymous Hi there would you mind posting exactly the regex pattern, replace, and flags (is possible) you are referring to? — Nick, Jul 06 '22 at 12:40
Currently i have `(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s` as the pattern with `\r\n` as replace. but with the above function this doesnt return anything — Nick, Jul 06 '22 at 12:44
@Nick Do you mean the pattern works at regex101 but not when you use it against your pdf text on your “local machine”? Does it work when you put some paragraphs of the pdf text in regex101? And vice versa? — Anonymous, Jul 06 '22 at 22:19
I had a closer look to your regex and understand better the “Hello.How …” issue due to the ```\w.\w.``` to exclude websites. I am not an regex expert. For me, it’ works better to use a sequence of simple regex instead of complex and nested. — Anonymous, Jul 06 '22 at 23:11
As an example: it assumes a new sentence starts with a character in Capital regardless the space in between and as an alternative approach to your regex it defines the exclusion explicitly such as Mr.|Mrs. etc. instead of [A-Z][a-z]\. … https://regex101.com/r/WEC0M9/1 … open issue: it won’t consider any numbers as a starting of a new sentence and if a website has any capital character after the dot like .COM it will break. To adjust this, need to modify the look-ahead at the end. — Anonymous, Jul 07 '22 at 00:22
Update: I tried with a few alternatives. My preference is version 6 (https://regex101.com/r/WEC0M9/6). If .COM is an issue, a faster solution for me would be to run another regex to change to .com before this one (or you may want to check version 7 (https://regex101.com/r/WEC0M9/7) or check with some regex experts ;-) … it seems also that the .NET regex engine is more picky about look-back and look-ahead expressions, could this be the reason in your case too? On regex101 this .NET engine has some problems while the php engine runs rather smoothly — Anonymous, Jul 07 '22 at 04:48
@Anonymous Im unsure if you are just using regex101 or also using PQ editor. The issue I currently face is all these regex are causing errors with the function and returning no result — Nick, Jul 07 '22 at 09:41
These days I have no access to PQ. Only iPad, so yes, just regex101. Actually, I also tested here with .NET and it works: http://regexstorm.net/tester (was provided as link from MS site) Maybe you could try the expression v1 first or even without the look-around part to check the engine on hand. — Anonymous, Jul 07 '22 at 09:49
The ? as in the regex string? Yes, this part is the look-backward and look-forward respectively. According to MS .NET it’s supported but I saw in some posts that it does not do well if another complex expression inside. (For my own use case there was no need for these look-around parts) — Anonymous, Jul 07 '22 at 10:06
Another idea for work around: replace those Mr. and i.e. first (with regex) then add the line breaks without need to look-backward and then revert those replacements. V8 works without look-forward at the end (limitation: breaks .COM, but this can also be replaced with .com before) — Anonymous, Jul 07 '22 at 10:23

A Better way to Extract Sentences from PDFs using Power Query/BI

0 Answers0

Linked