1

I am trying to write a program with python. I want to substitute whitespaces in a txt document with new lines. I have tried writing it myself but my output file gets filled with weird characters. Can you help? :)

Freddy
  • 39
  • 3
  • 8
    you should post what you tried – Stephan Jul 26 '13 at 20:46
  • 1
    Disregard all answers with a for loop in it. A `f.read()` reads the whole file. Looping over lines is too expensive if you don't need that explicitely. – erikbstack Jul 26 '13 at 21:04
  • Please remember to mark the answer that helped you the most as solution by clicking the green tick under the voting box of that answer. – erikbstack Jul 26 '13 at 21:19
  • actually it's not working well, only worked on some sentences and not on others. this was my code: import re with open('diviso.txt') as f, open('diviso2.txt', 'w') as out: for line in f: new_line = re.sub('\s', '\n', line) # print new_line out.write(new_line) – Freddy Jul 26 '13 at 21:34
  • why did it not work on all sentences :(? – Freddy Jul 26 '13 at 21:34
  • I also tried this code: with open("text.txt", 'r') as oFile: lResults = [line.replace(" ", "\n") for line in oFile] with open("results.txt", "w") as oFile: oFile.writelines(lResults) – Freddy Jul 26 '13 at 21:37

4 Answers4

3

Here you go:

lResults = list()
with open("text.txt", 'r') as oFile:
    for line in oFile:
        sNewLine = line.replace(" ", "\n")
        lResults.append(sNewLine)

with open("results.txt", "w") as oFile:
    for line in lResults:
        oFile.write(line)

Here an "optimized" version after the suggestions in the comments:

with open("text.txt", 'r') as oFile:
    lResults = [line.replace(" ", "\n") for line in oFile]

with open("results.txt", "w") as oFile:
    oFile.writelines(lResults)

EDIT: Response to comment:

hey sebastian - I just tried your code, it keeps giving me the weird characters in the output file! am i doing something wrong with it? – Freddy 1 min ago

What do you mean by "weird" characters? Do you have a non-ASCII file? Sorry, but for me it works perfectly fine, I just tested it.

enter image description here enter image description here

  • hey sebastian - I just tried your code, it keeps giving me the weird characters in the output file! am i doing something wrong with it? – Freddy Jul 26 '13 at 20:52
  • hey, I responded in the answer above –  Jul 26 '13 at 21:00
  • If you are on a *nix system (or using cygwin on Windows), I would write to stdout rather than a new file. This gives you the power to just print it out, redirect it `>` to a file of your choice, or pipe it `|` to another program. – mk12 Jul 26 '13 at 21:02
  • I am on windows, using txt document. I have tried again, this time the output file is empty :( – Freddy Jul 26 '13 at 21:04
  • hm, sorry, I don't know much about windows. I think I can't help you there. –  Jul 26 '13 at 21:05
  • Kudos for providing such a roundabout answer. But sadly the code is not optimal. However, with your approach you will be one of the best Python coders fast! Keep up the good work! – erikbstack Jul 26 '13 at 21:13
  • 1
    Good answer, but I'd prefer a list comprehension here. `lResults = [line.replace(" ", "\n") for line in oFile]`, secondly when writing the content of that list(`lResults`) to that output file, simply use: `oFile.writelines(lResults)`. No need of a loop at all. And one more thing better use `str.translate`, it is 5 times faster than `str.replace`. – Ashwini Chaudhary Jul 26 '13 at 21:28
  • ah cool, sounds reasonable. Will apply it next time I have to write sth. similar like this! –  Jul 26 '13 at 21:29
  • I just tried the optimized code on my sentences - it works on some but not on all :( – Freddy Jul 26 '13 at 21:37
  • can you post an example, that would be helpful to account for those special cases –  Jul 26 '13 at 21:50
  • I went to a shop that sold organic food and bought everything. I do not think the upswing will start before spring next year. The registration of students and all grants and loans have been suspended. Mr Howell is convinced that the tea market – Freddy Jul 26 '13 at 22:05
  • this is an example, some of these were divided (from Mr Howell onwards), the others weren't :( – Freddy Jul 26 '13 at 22:05
  • my file has around 600 of these sentences – Freddy Jul 26 '13 at 22:06
  • ok, i think i know why it didn't work - i pasted the text from an excel document. now, extracting the text from the tables and THEN pasting it into the txt document worked like a wonder! sorry for that! – Freddy Jul 26 '13 at 22:26
  • Hehe, I am really confused now, but if it worked, great ;) –  Jul 26 '13 at 22:54
2

Try this:

import re
s = 'the text to be processed'
re.sub(r'\s+', '\n', s)
=> 'the\ntext\nto\nbe\nprocessed'

Now, the "text to be processed" above will come from the input text file, that you previously read in a string - see this answer for details on how to do this.

Community
  • 1
  • 1
Óscar López
  • 232,561
  • 37
  • 312
  • 386
  • (And if you want *every* whitespace character swapped for a `\n`, get rid of the `+`. (@Oscar's solution will convert multiple spaces in a row to only one `\n`)) – cwallenpoole Jul 26 '13 at 20:47
  • thanks! Will this work on a file of 600+ sentences? – Freddy Jul 26 '13 at 20:48
1

You can achieve this with regular expressions:

import re

with open('thefile.txt') as f, open('out.txt', 'w') as out:
    for line in f:
        new_line = re.sub('\s', '\n', line)
        # print new_line
        out.write(new_line)

You probably need to write back new_line to a file instead of printing it :) (==> snippet edited).


See the python regex documentation:

sub(pattern, repl, string, count=0, flags=0)
  • pattern: the search pattern
  • repl: the replace pattern
  • string: the string to be processed, in this case, line

Note: if you only want to substitute whitespaces which occur at the end of the line, use the \s$ search pattern, where $ stands for the end of the string (so that reads "a space at the end of the string"). If you really need to replace just every space, then the replace method of str is probably enough.

Vincenzo Pii
  • 18,961
  • 8
  • 39
  • 49
1
def (in_file, out_file):
  with open(in_file, 'r') as i, open(out_file, 'w') as o:
     w.write(i.read().replace(' ', os.linesep))

Notice that this neither loops nor writes '\n' but instead os.linesep which will be \n on Linuxes and \r\n on Windows and so on.

Also notice that the biggest part of the answer comes from alwaysprep and he should get the credit for it, if he takes the loop out of his solution. (Did he actually deleted his answer? Can't find it anymore.)

Community
  • 1
  • 1
erikbstack
  • 12,878
  • 21
  • 81
  • 115
  • I don't think there's anything wrong with using a loop, using `file.read()` puts the whole file content into memory, which is not memory efficient at all. – Ashwini Chaudhary Jul 26 '13 at 21:20
  • Only seldomly will you hit memory limitations with text files. And without the loop you gain faster iteration (in the underlying interpreter, not in your code) and better maintenance because of less code. "Simple is better then complex." – erikbstack Jul 26 '13 at 23:39