2

simple example: func-tional --> functional

The story is that I got a Microsoft Word document, which is converted from PDF format, and some words remain hyphenated (such as func-tional, broken because of line break in PDF). I want to recover those broken words while normal ones(i.e., "-" is not for word-break) are kept.

In order to make it more clear, one long example (source text) is added:

After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance.

Could someone give me some suggestions on this problem?

Ian
  • 160
  • 11
  • 3
    `s.replace("-\n", "")`? – iBug Sep 03 '18 at 08:12
  • 1
    if the `\n` is kept use the line from comment above, otherwise check out this post: https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python – Chris Sep 03 '18 at 08:13
  • 1
    How are you going to tell line-split hyphens from necessary hyphens? – khelwood Sep 03 '18 at 08:14
  • @khelwood: like the name suggestes: line-split hyphens are located just before a line break, so on windows that might be `\r\n` so if you find the combination `-\r\n` it's a hyphen to be replaced. – meissner_ Sep 03 '18 at 08:25
  • 1
    @meissner_ What I mean is, how do you tell if a hyphen at the end of the line was added when the line was split, or if it was already a hyphenated word, with a line break falling at the position where the hyphen was already present? – khelwood Sep 03 '18 at 08:40
  • @khelwood Simple: i don't. And to this day it didn't bite me in the 4$$ yet. Probably because this case is exceedingly rare as long as people are responsible for hyphenating their own text? – meissner_ Sep 03 '18 at 09:01
  • could you provide an example with both cases a hyphen that should be replaced and one that should stay at the end of a line. Your example is only one long String at the Moment. – Sharku Sep 03 '18 at 09:04
  • @meissner_ Not sure why you'd think it was exceedingly rare. Any decent-sized text containing hyphenated words will occasionally line break on them. – khelwood Sep 03 '18 at 09:09
  • 1
    @khelwood, I don't think it is exceedingly rare, too. So it is worth figuring out a solid solution. – Ian Sep 03 '18 at 11:36
  • @Sharku, each of my source text is one long(or short) String. There is only one end of a line in each of my source text. And "-\n" (or "-\r\n") would not exist in this case. – Ian Sep 03 '18 at 11:44

1 Answers1

2

I would use regular expression. This little script searches for words with hyphenated and replaces the hyphenated by nothing.

import re


def replaceHyphenated(s):
    matchList = re.findall(r"\w+-\w+",s) # find combination of word-word 
    sOut = s
    for m in matchList:
        new = m.replace("-","")
        sOut = sOut.replace(m,new)
    return sOut



if __name__ == "__main__":

    s = """After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance."""    
    print(replaceHyphenated(s))

output would be:

After the symposium, the Foundation and the FCF steering team continued their work and created the Functional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendium, contact your manufacturer for further guidance.

If you are not used to RegExp I recommend this site: https://regex101.com/

Sharku
  • 1,052
  • 1
  • 11
  • 24
  • thanks! But how to tell line-split hyphens from necessary hyphens (just like @khelwood has said) ? – Ian Sep 03 '18 at 08:57
  • you mean at the end of the line? or do you mean words that need one like mother-in-law ? Because if there is a line split character like func -\n tional that will not match with the RegExp – Sharku Sep 03 '18 at 09:00
  • Yes, I mean we need to keep the "-" in "mother-in-law" while remove the one in "Func-tional". And there don't exist "\n" anymore in my source text (i.e., a paragraph). – Ian Sep 03 '18 at 11:26
  • 1
    Puuuh I am sorry but I don't think that is possible. How should the computer know if the hyphen belongs to a word like mother-in-law or is random like in func-tional. Perhaps you can change the read in algorithm to include new line characters or something. I means that is what they are for, to tell the computer when a new line begins. – Sharku Sep 03 '18 at 11:36
  • Perhaps you're right, I have anticipated it would be difficult if it is possible. My source texts come from PDF --> Word ([PDF to DOCX](https://pdftotext.com/)) -->.txt file. This saves me from parsing the PDF lines into paragraphs correctly by myself. But, I got to remove the "-" in the .txt file which produced this question. – Ian Sep 03 '18 at 12:02