0

I try to concatenate txt files, and almost all goes well but the out file has a space between each letter like l o r e m i p s u m

here's my code

import glob

all = open("all.txt","a");

for f in glob.glob("*.txt"):
    print f
    t = open(f, "r")
    all.write(t.read())
    t.close()

all.close()

I'm working on windows 7, python 2.7

EDIT
Maybe there's better way to concatenate files?

EDIT2
I got decoding issues now:

Traceback (most recent call last):
  File "P:\bwiki\BWiki\MobileNotes\export\999.py", line 9, in <module>
    all.write( t.read())
  File "C:\Python27\lib\codecs.py", line 671, in read
    return self.reader.read(size)
  File "C:\Python27\lib\codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 18: invalid
continuation byte


import codecs
import glob

all =codecs.open("all.txt", "a", encoding="utf-8")

for f in glob.glob("*.txt"):
    print f
    t = codecs.open(f, "r", encoding="utf-8")
    all.write( t.read())
Chris
  • 3,405
  • 4
  • 29
  • 34
  • Best way to concatenate text files in with a simple batch command. You can simply add the files together as if they were numbers. – PlamZ Jan 15 '15 at 16:48
  • 1
    I suspect this bug may have something to do with the fact that you're opening `all.txt` twice. Once when assigning it to `all` and another time when you're opening it in your loop. `all.txt` will match the glob `"*.txt"`. – Alex Bliskovsky Jan 15 '15 at 16:51
  • 1
    @AlexBliskovsky I don't think that would produce the symptoms described, but you're right that that's a bug *also*. – zwol Jan 15 '15 at 16:52
  • @PlamZ tried with `type` and effect was same - i got each letter separated by space – Chris Jan 15 '15 at 16:53
  • This smells like an encoding issue to me. Unfortunately I don't know of an encoding *detector* in the Python 2 stdlib. – zwol Jan 15 '15 at 16:53
  • @Chris Try `copy /?` – PlamZ Jan 15 '15 at 16:57
  • 1
    It is best to use [the `with` statement](https://www.youtube.com/watch?v=lRaKmobSXF4) when working with files in Python. – Gareth Latty Jan 15 '15 at 17:03

3 Answers3

2

Your input file is probably UTF-encoded, but you're reading it as ASCII, which causes the spaces to appear (reflecting null bytes). Try:

import codecs

...

for f in glob.glob("*.txt"):
    print f
    t = codecs.open(f, "r", encoding="utf-16")
Daniel Robinson
  • 3,347
  • 2
  • 18
  • 20
1

"space" between letters might indicate that at least some of the files use utf-16 encoding.

If all files use the same character encoding then you could use cat(1) command that is copy the files as bytes (code example in Python 3). Here's cat PowerShell command that corresponds to your Python code:

PS C:\> Get-Content *.txt | Add-Content all.txt

Unlike cat *.txt >> all.txt; It should not corrupt the character encoding.

Your code should work if you use binary file mode:

from glob import glob
from shutil import copyfileobj

with open('all.txt', 'ab') as output_file:
    for filename in glob("*.txt"):
        with open(filename, 'rb') as file:
            copyfileobj(file, output_file)

Again, all files should have the same character encoding otherwise you may get garbage (mixed content) in the output.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

Please run this program and edit the output into your question (we probably only need to see the first five lines of output, or so). It prints the first 16 bytes of each file in hexadecimal. This will help us figure out what is going on.

import glob
import sys

def hexdump(s):
    return " ".join("{:02x}".format(ord(c)) for c in s)

l = 0
for f in glob.glob("*.txt"):
    l = max(l, len(f))

for f in glob.glob("*.txt"):
    with open(f, "rb") as fp:
       sys.stdout.write("{0:<{1}}  {2}\n".format(f, l, hexdump(fp.read(16))))
zwol
  • 135,547
  • 38
  • 252
  • 361