1

Pulling data into excel from a text file of PDF occasionally splits sentences.

Following a previous post, Using PQ I can group text together and then with regex identify . belonging to the end of a sentence and split using this. This works really well, except where a Subheading may exist without punctuation. This results in a minor error where sentences are occasionally joined together. for example

SUBHEADING 1
Non-sensical
sentence 1. Sentence 2.

Returns as:

SUBHEADING 1 Non-sensical sentence 1.
Sentence 2.

I initially thought this was the best that could be done, however, the focus so far has been on identifying the end of a sentence with . Additionally though, we could also use the fact that sentences tend to start with a capital letter to refine this further. The way excel interprets PDF pages means that prior transformations are required to group the text, which I am pretty certain rules out using regex.

Here is an example of what I am trying to do (orange) and where I currently am (green):

enter image description here

Notice how each sentence starts with a capital and ends with .. I think this will be a logical way to separate SUBHEADING/CHAPTERS, etc., from the rest of the text. From this, I can then apply the remainder of the transformations on the previous post and if im right, this should give better separation.

So far:

I have identified which sentences Start with a Captial letter/lowercase and which end in lowercase or . (where ending in a number is also considered lowercase). Using this, I think it's safe to assume that:

  1. If a row ends lowercase AND the next row also contains lowercase, this belongs to the same sentence and can be appended.

  2. If the Sentence starts Uppercase and ends in a full stop, this is a complete sentence and requires no changes.

  3. If a Sentence starts uppercase without a .At the end and then Next is uppercase; this is a subheading/Extra info, not a typical sentence.

I think it is possible to generate the Desired table using these rules. I am having trouble determining the transformations required to append the sentences following the above rules. I need some sort of way to identify the Upper/lower/. of rows above and below and I just cant see how.

I will update as I progress; however, if anyone could comment on how to achieve this using these rules, that would be great.

Note there is an error with the current M code where 3d. Is recognised as uppercase.

M Code:

let
    Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
            #"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Start([Column1], 1) = Text.Lower(Text.Start([Column1], 1)) then "Lowercase" else "Uppercase"),
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each if Text.End([Column1], 1) = "." then Text.End([Column1], 1) else if Text.End([Column1], 1) = Text.Lower(Text.End([Column1], 1)) then "Lowercase" else null)
in
    #"Added Custom1"

Source Data:

SUBHEADING 1
Non-sensical
sentence 1. Sentence 2.
Sentence 3 part 3a,
part 3b, and
part 3c.
SUBHEADING 1.1.
Non-sensical
sentence 4. Sentence 5.
Sentence 6 part 3a,
part 3b, 3c and
3d.
2.0. SUBHEADING
Extra Info 1 (Not a proper sentence)
Extra Info 2
Sentence 7.
SUBHEADING 3.0
Sentence 8.

SOLUTION SO FAR (Minor error if Subheading starts with number)

let
    Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
            #"Added Custom" = Table.AddColumn(Source, "Start", each if Text.Start([Column1], 1) = Text.Lower(Text.Start([Column1], 1)) then "Lowercase" else "Uppercase"),
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "End", each if Text.End([Column1], 1) = "." then Text.End([Column1], 1) else if Text.End([Column1], 1) = Text.Lower(Text.End([Column1], 1)) then "Lowercase" else null),
    #"Added Custom4" = Table.AddColumn(#"Added Custom1", "Complete Sentences", each if [Start] = "Uppercase" and [End] = "." then "COMPLETE" else null),
    #"Added Index" = Table.AddIndexColumn(#"Added Custom4", "Index", 0, 1, Int64.Type),
    #"Added Custom2" = Table.AddColumn(#"Added Index", "StartCheckCaseBelow", each try #"Added Index" [Start] { [Index]  + 1 } otherwise "COMPLETE"),
    #"Removed Columns1" = Table.RemoveColumns(#"Added Custom2",{"Index"}),
    #"Added Custom3" = Table.AddColumn(#"Removed Columns1", "CompleteHeadings_CompleteText", each if [StartCheckCaseBelow]= "COMPLETE" then "COMPLETE" else if [Start] = "Uppercase" and [StartCheckCaseBelow] = "Uppercase" then "COMPLETE" else null),
    #"Added Custom5" = Table.AddColumn(#"Added Custom3", "Custom", each if [CompleteHeadings_CompleteText] = null then if [Start] = "Uppercase" and [End] = "Lowercase" then "START" else if [End] = "." then "END" else null else null),
    #"Added Index1" = Table.AddIndexColumn(#"Added Custom5", "Index", 0, 1, Int64.Type),
    #"Merged Columns" = Table.CombineColumns(Table.TransformColumnTypes(#"Added Index1", {{"Index", type text}}, "en-GB"),{"Index", "CompleteHeadings_CompleteText"},Combiner.CombineTextByDelimiter(" ", QuoteStyle.None),"CompleteHeadings_CompleteText"),
    #"Added Index3" = Table.AddIndexColumn(#"Merged Columns", "Index", 0, 1, Int64.Type),
    #"Filtered Rows" = Table.SelectRows(#"Added Index3", each ([Custom] = "START")),
    #"Added Index2" = Table.AddIndexColumn(#"Filtered Rows", "temp", 0, 1, Int64.Type),
    combined = #"Added Index2" & Table.SelectRows(#"Added Index3", each [Custom] <> "START"),
    #"Sorted Rows" = Table.Sort(combined,{{"Index", Order.Ascending}}),
    #"Removed Columns" = Table.RemoveColumns(#"Sorted Rows",{"Index"}),
    #"Added Custom6" = Table.AddColumn(#"Removed Columns", "Custom.1", each if Text.Contains([CompleteHeadings_CompleteText], "COMPLETE") then [CompleteHeadings_CompleteText] else if [Complete Sentences] <> null then [Complete Sentences] else [temp]),
    #"Filled Down" = Table.FillDown(#"Added Custom6",{"Custom.1"}),
    #"Removed Other Columns" = Table.SelectColumns(#"Filled Down",{"Column1", "Custom.1"}),
    #"Grouped Rows" = Table.Group(#"Removed Other Columns", {"Custom.1"}, {{"Text", each Text.Combine([Column1], " "), type text}}),
    #"Removed Other Columns1" = Table.SelectColumns(#"Grouped Rows",{"Text"})
in
    #"Removed Other Columns1"

This works to achieve the table in orange however requires many steps which need to be simplified. If you could advise on this or know of a better way please let me know.

Nick
  • 789
  • 5
  • 22

1 Answers1

1

enter image description here

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
            #"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Start([Column1], 1) = Text.Lower(Text.Start([Column1], 1)) then " " else "| "),
    #"Merged Columns" = Table.CombineColumns(#"Added Custom",{ "Custom", "Column1"},Combiner.CombineTextByDelimiter("", QuoteStyle.None),"Merged"),
    #"Transposed Table" = Table.Transpose(#"Merged Columns"),
    Custom1 = Table.ColumnNames( #"Transposed Table"),
    Custom2 = #"Transposed Table",
    #"Merged Columns1" = Table.CombineColumns(Custom2,Custom1,Combiner.CombineTextByDelimiter("", QuoteStyle.None),"Merged"),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Merged Columns1", {{"Merged", Splitter.SplitTextByDelimiter("| ", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Merged"),
    #"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Split Column by Delimiter", {{"Merged", Splitter.SplitTextByDelimiter(". ", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Merged"),
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Merged", type text}})
in
    #"Changed Type"

Will be interested to see if there are more elegant solutions out there.

Davide Bacci
  • 16,647
  • 3
  • 10
  • 36
  • This is near perfect. However, there is just one issue which Im not sure if it can be resolved. Say you had 2.0. SUBHEADING, and Sentence 3.5. this would be split. Very workable regardless but will just explore if this can also be captured in some way. – Nick Jul 25 '22 at 21:45
  • 1
    Not sure. I just used the test data you provided and expected output. – Davide Bacci Jul 25 '22 at 21:51
  • Instead of using Text.Lower, you can check if the first char is in a list {"a".."z",0..9 } and add any other special chars to your list too. I'm not sure what your logic is though as I just used the test data and it worked for that. – Davide Bacci Jul 26 '22 at 06:10
  • I will update my question. Apologies because it is frustrating to solve it only for the question to change. I will accept your answer regardless if unable to find an answer. It may be that this change makes the logic incomplete. – Nick Jul 26 '22 at 08:02
  • Okay so after thinking about it I think there is an issue with you answer in respect to the question in that the logic only pays attention to whether or not the first character is Upper or lowercase. Really I think the question is to try and combine the logic of both the start and end of the sentences. – Nick Jul 26 '22 at 08:58