Eliminate syllabic separation in txt files

Question

First, I would like to explain that since this is my first post, I did a lot of research before publishing my question, as is suggested in the Q&A of this excellent platform. The second point I make is that I am not a python expert. In fact, I'm just an enthusiast for this great programming language. Well, I'm trying to fix a relatively large txt file. The point of the question is the correction of the syllabic separations that exist in the referred file, and in great quantity. In my research, I found some articles similar to my question, which even helped me a lot to think in different perspectives. Among these articles, I can highlight some such as:

How to delete empty lines

Using grep:

grep -v '^$' test1.txt > test2.txt

How to delete all blank lines

Using fileinput:

import fileinput for line in fileinput.FileInput("file",inplace=1): if line.rstrip(): print line

But, unfortunately my question is more specific. In my txt files there are many syllabic separations. Thank goodness, they are all standardized this way:

example_of_w'- '

ord

My goal would be, through some python script, all file separations to be corrected / eliminated, as an example below:

Before script:

example_of_w'- '

ord

After script:

example_of_word

Note that the syllabic separation is patterned with - and 'space'. Please excuse me if I was not able to be clear in my question, and with my language mistakes. I thank you all for any help. An excellent day for everyone!

Can you **provide a line example** where the pattern is present? Is this correct: `example_of_w- ord` must output `example_of_word`.? Or should it have the single quotes? — DarK_FirefoX, Apr 07 '20 at 13:44
can you please upload your text file and give a link here, or you can just paste some of the text here. The example you provided doesn't make it clear that what is the structure of your input file, thus, we can't make any script you ! — Zain Arshad, Apr 07 '20 at 13:51
with open('path/to/file') as infile, open('output.txt', 'w') as outfile: for line in infile: if not line('- '): continue # correct/eliminate the ('- ') word outfile.write(line) # non-('- ')word. Write it to output — Evandro Mourão, Apr 07 '20 at 14:57
That's exactly it, @DarK_FirefoX ! That's the objective. output: example_of_word — Evandro Mourão, Apr 07 '20 at 15:45
can you also provide an example of word separated by space? If it's like the example you gave, two chunks in separate lines but without '-' I doubt there is much you can do, cause at that point to distinguish the cases in which a space is separating two chunks instead of two different words you would need a dictionary to check whether two connected subsequent chunks are a valid word or not (and you don't have such dictionary cause you are trying to built it) — Edoardo Guerriero, Apr 07 '20 at 19:23
Hi @EdoardoGuerriero. Sorry for my bad explanation. I don't know python's synthesis very well to describe the issue. But, for example, we have any phrase, in a txt file like this one: ***refuge or dis- charge area to exit the event grounds.*** I would like to create a script that takes all the lines in the file, and corrects these syllable separations, at once. Most of the data I work on comes with these separations. — Evandro Mourão, Apr 07 '20 at 23:28
It doesn't matter if after the correction, the sentence stays on the same line or not. The goal here would just be to say, "rewire" the words. The only thing we will have to worry about is putting an "exception", in words that are already spelled correctly, but that have the hyphens ('-'). But I think that would not be difficult, because it would be glued. Thank you for your effort in trying to help. :) — Evandro Mourão, Apr 07 '20 at 23:29
Thanks for your answers @Dark_FirefoX, the problem is, how to do that for all lines of a large txt file? — Evandro Mourão, Apr 08 '20 at 11:23
@ECMJ, I edited my answer to do the same thing over a large `txt` file. — DarK_FirefoX, Apr 08 '20 at 13:14

DarK_FirefoX · Accepted Answer · 2020-04-08T14:50:51.520

I don´t know the complete scope of your problem, but as of now, for the little information you provide. You could do:

a = "example_of_w- ord has to be interest- ing"

# replace of occurrences of first argument with the second argument
print(a.replace("- ", ""))

Outputs:

example_of_word has to be interesting

EDIT:

If you want to do this to all lines on a txt file you can do this:

This is the content of sy.txt:

example_of_w- ord has to be interest- ing. refuge or dis- charge area to exit the event grounds. refuge or dis- charge area to exit the event grounds. example_of_w- ord has to be interest- ing. wa- ter in the riv- er is wonder- ful

This is the script on the same folder than sy.txt:

output = ""
replaceParameter = "- "
with open("sy.txt") as f:
    for line in f:
        output += line.replace(replaceParameter, "")

print(output)

And the output would be:

example_of_word has to be interesting. refuge or discharge area to exit the event grounds. refuge or discharge area to exit the event grounds. example_of_word has to be interesting. water in the river is wonderful

As you can see, I opened a file and then I loop through all the lines in it and replace the replaceParameter = "- " for an empty string.

EDIT 2:

This would work for cases on the end of the line:

output = ""
replaceParameter = "- "

with open("sy.txt") as f:
    for line in f:
        output += line

output = output.replace("\n- ", "")
output = output.replace("-\n ", "")
output = output.replace("- \n", "")
output = output.replace(replaceParameter , "")

print(output)

Trying it out for this input:

example_of_w- ord has to be interest- ing. refuge or dis- charge area to exit the event grounds.
refuge or dis- charge area to exit the event grounds. example_of_w- ord has to be interest- ing.
wa- ter in the riv- er is wonder- ful. refuge or dis- charge area to exit the event grounds ref- 
uge or dis- charge area to exit the event grounds. refuge or dis- charge area to exit the linebr
- eaks

And the output:

example_of_word has to be interesting. refuge or discharge area to exit the event grounds. refuge or discharge area to exit the event grounds. example_of_word has to be interesting. water in the river is wonderful. refuge or discharge area to exit the event grounds refuge or discharge area to exit the event grounds. refuge or discharge area to exit the linebreaks

Note that this will not work if the chunks to combine are in separate lines — Edoardo Guerriero, Apr 08 '20 at 13:37
@EdoardoGuerriero, you are completely right, but I fail to see where he stated that. Just being reactive. So @ECMJ, Could the `- ` be at the end of a line? Could the `-` be at the and and the space ` ` on the next line?. Pretty sure I know the answers to this question, but would want them clarified. — DarK_FirefoX, Apr 08 '20 at 13:49
@ECMJ, edited with a solution that handle the linebreaks. Hope this helps. — DarK_FirefoX, Apr 08 '20 at 14:51

Eliminate syllabic separation in txt files

1 Answers1