Pulling data into excel from a text file of PDF occasionally splits sentences.
Following a previous post, Using PQ I can group text together and then with regex identify .
belonging to the end of a sentence and split using this. This works really well, except where a Subheading may exist without punctuation. This results in a minor error where sentences are occasionally joined together. for example
SUBHEADING 1
Non-sensical
sentence 1. Sentence 2.
Returns as:
SUBHEADING 1 Non-sensical sentence 1.
Sentence 2.
I initially thought this was the best that could be done, however, the focus so far has been on identifying the end of a sentence with .
Additionally though, we could also use the fact that sentences tend to start with a capital letter to refine this further. The way excel interprets PDF pages means that prior transformations are required to group the text, which I am pretty certain rules out using regex.
Here is an example of what I am trying to do (orange) and where I currently am (green):
Notice how each sentence starts with a capital and ends with .
. I think this will be a logical way to separate SUBHEADING/CHAPTERS, etc., from the rest of the text. From this, I can then apply the remainder of the transformations on the previous post and if im right, this should give better separation.
So far:
I have identified which sentences Start with a Captial letter/lowercase and which end in lowercase or .
(where ending in a number is also considered lowercase). Using this, I think it's safe to assume that:
If a row ends lowercase AND the next row also contains lowercase, this belongs to the same sentence and can be appended.
If the Sentence starts Uppercase and ends in a full stop, this is a complete sentence and requires no changes.
If a Sentence starts uppercase without a
.
At the end and then Next is uppercase; this is a subheading/Extra info, not a typical sentence.
I think it is possible to generate the Desired table using these rules. I am having trouble determining the transformations required to append the sentences following the above rules. I need some sort of way to identify the Upper/lower/. of rows above and below and I just cant see how.
I will update as I progress; however, if anyone could comment on how to achieve this using these rules, that would be great.
Note there is an error with the current M code where 3d. Is recognised as uppercase.
M Code:
let
Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Start([Column1], 1) = Text.Lower(Text.Start([Column1], 1)) then "Lowercase" else "Uppercase"),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each if Text.End([Column1], 1) = "." then Text.End([Column1], 1) else if Text.End([Column1], 1) = Text.Lower(Text.End([Column1], 1)) then "Lowercase" else null)
in
#"Added Custom1"
Source Data:
SUBHEADING 1
Non-sensical
sentence 1. Sentence 2.
Sentence 3 part 3a,
part 3b, and
part 3c.
SUBHEADING 1.1.
Non-sensical
sentence 4. Sentence 5.
Sentence 6 part 3a,
part 3b, 3c and
3d.
2.0. SUBHEADING
Extra Info 1 (Not a proper sentence)
Extra Info 2
Sentence 7.
SUBHEADING 3.0
Sentence 8.
SOLUTION SO FAR (Minor error if Subheading starts with number)
let
Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Start", each if Text.Start([Column1], 1) = Text.Lower(Text.Start([Column1], 1)) then "Lowercase" else "Uppercase"),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "End", each if Text.End([Column1], 1) = "." then Text.End([Column1], 1) else if Text.End([Column1], 1) = Text.Lower(Text.End([Column1], 1)) then "Lowercase" else null),
#"Added Custom4" = Table.AddColumn(#"Added Custom1", "Complete Sentences", each if [Start] = "Uppercase" and [End] = "." then "COMPLETE" else null),
#"Added Index" = Table.AddIndexColumn(#"Added Custom4", "Index", 0, 1, Int64.Type),
#"Added Custom2" = Table.AddColumn(#"Added Index", "StartCheckCaseBelow", each try #"Added Index" [Start] { [Index] + 1 } otherwise "COMPLETE"),
#"Removed Columns1" = Table.RemoveColumns(#"Added Custom2",{"Index"}),
#"Added Custom3" = Table.AddColumn(#"Removed Columns1", "CompleteHeadings_CompleteText", each if [StartCheckCaseBelow]= "COMPLETE" then "COMPLETE" else if [Start] = "Uppercase" and [StartCheckCaseBelow] = "Uppercase" then "COMPLETE" else null),
#"Added Custom5" = Table.AddColumn(#"Added Custom3", "Custom", each if [CompleteHeadings_CompleteText] = null then if [Start] = "Uppercase" and [End] = "Lowercase" then "START" else if [End] = "." then "END" else null else null),
#"Added Index1" = Table.AddIndexColumn(#"Added Custom5", "Index", 0, 1, Int64.Type),
#"Merged Columns" = Table.CombineColumns(Table.TransformColumnTypes(#"Added Index1", {{"Index", type text}}, "en-GB"),{"Index", "CompleteHeadings_CompleteText"},Combiner.CombineTextByDelimiter(" ", QuoteStyle.None),"CompleteHeadings_CompleteText"),
#"Added Index3" = Table.AddIndexColumn(#"Merged Columns", "Index", 0, 1, Int64.Type),
#"Filtered Rows" = Table.SelectRows(#"Added Index3", each ([Custom] = "START")),
#"Added Index2" = Table.AddIndexColumn(#"Filtered Rows", "temp", 0, 1, Int64.Type),
combined = #"Added Index2" & Table.SelectRows(#"Added Index3", each [Custom] <> "START"),
#"Sorted Rows" = Table.Sort(combined,{{"Index", Order.Ascending}}),
#"Removed Columns" = Table.RemoveColumns(#"Sorted Rows",{"Index"}),
#"Added Custom6" = Table.AddColumn(#"Removed Columns", "Custom.1", each if Text.Contains([CompleteHeadings_CompleteText], "COMPLETE") then [CompleteHeadings_CompleteText] else if [Complete Sentences] <> null then [Complete Sentences] else [temp]),
#"Filled Down" = Table.FillDown(#"Added Custom6",{"Custom.1"}),
#"Removed Other Columns" = Table.SelectColumns(#"Filled Down",{"Column1", "Custom.1"}),
#"Grouped Rows" = Table.Group(#"Removed Other Columns", {"Custom.1"}, {{"Text", each Text.Combine([Column1], " "), type text}}),
#"Removed Other Columns1" = Table.SelectColumns(#"Grouped Rows",{"Text"})
in
#"Removed Other Columns1"
This works to achieve the table in orange however requires many steps which need to be simplified. If you could advise on this or know of a better way please let me know.