3

This is the code i am using in order to replace special characters in text files and concatenate them to a single file.

# -*- coding: utf-8 -*-

    import os
    import codecs

    dirpath = "C:\\Users\\user\\path\\to\\textfiles"
    filenames = os.listdir(dirpath)

    with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
        for fname in filenames:
            currentfile = dirpath+"\\"+fname
            with codecs.open(currentfile, encoding='utf8') as infile:
        #print currentfile
                outfile.write(fname)
                outfile.write('\n')
                outfile.write('\n')

                for line in infile:

                    line = line.replace(u"´ı", "i")
                    line = line.replace(u"ï¬", "fi")
                    line = line.replace(u"fl", "fl")
                    outfile.write (line)

The first line.replace works fine while the others do not (which makes sense) and since no errors were generated, i though there might be a problem of "visibility" (if that's the term).And so i made this:

import codecs

currentfile = 'textfile.txt'
with codecs.open('C:\\Users\\user\\path\\to\\output2.txt', 'w', encoding='utf-8') as outfile:
with open(currentfile) as infile:
for line in infile:
if "ï¬" not in line: print "not found!"

which always returns "not found!" proving that those characters aren't read.

When changing to with codecs.open('C:\Users\user\path\to\output.txt', 'w', encoding='utf-8') as outfile: in the first script, i get this error:

Traceback (most recent call last):
File C:\\path\\to\\concat.py, line 30, in <module>
outfile.write(line)
File C:\\Python27\\codecs.py, line 691, in write
return self.writer.write(data)
File C:\\Python27\\codecs.py, line 351, in write
data, consumed = self.encode(object, self.errors)
Unicode DecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal
not in range (128)

Since i am not really experienced in python i can't figure it out, by the different sources already available: python documentation (1,2) and relevant questions in StackOverflow (1,2)

I am stuck here. Any suggestions?? all answers are welcome!

Community
  • 1
  • 1
  • I'd suggest you to print the `repr` of the lines and of the characters you are trying to replace. They probably looks the same but are different characters internally. – Bakuriu Jun 11 '13 at 11:03

1 Answers1

0

There is no point in using codecs.open() if you don't use an encoding. Either use codecs.open() with an encoding specified for both reading and writing, or forgo it completely. Without an encoding, codecs.open() is an alias for just open().

Here you really do want to specify the codec of the file you are opening, to process Unicode values. You should also use unicode literal values when straying beyond ASCII characters; specify a source file encoding or use unicode escape codes for your data:

# -*- coding: utf-8 -*- 
import os
import codecs

dirpath = u"C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)

with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
    for fname in filenames:
        currentfile = os.path.join(dirpath, fname)
        with codecs.open(currentfile, encoding='utf8') as infile:
            outfile.write(fname + '\n\n')
            for line in infile:
                line = line.replace(u"´ı", u"i")
                line = line.replace(u"ï¬", u"fi")
                line = line.replace(u"fl", u"fl")
                outfile.write (line)

This specifies to the interpreter that you used the UTF-8 codec to save your source files, ensuring that the u"´ı" code points are correctly decoded to Unicode values, and using encoding when opening files with codec.open() makes sure that the lines you read are decoded to Unicode values and ensures that your Unicode values are written out to the output file as UTF-8.

Note that the dirpath value is a Unicode value as well. If you use a Unicode path, then os.listdir() returns Unicode filenames, which is essential if you have any non-ASCII characters in those filenames.

If you do not do all this, chances are your source code encoding does not match the data you read from the file, and you are trying to replace the wrong set of encoded bytes with a few ASCII characters.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • i didn't know about the `u` unicode escape code so i added it in my script. unfortunately nothing changed, i am always getting the same error. –  Jun 11 '13 at 13:09
  • What error? Same line? Is it a UnicodeEncode exception this time perhaps? – Martijn Pieters Jun 11 '13 at 13:10
  • You want to use unicode exclusively in your code. That means you need to a) tell Python what codec your source file uses, b) what codec the files you are reading use and c) what codec the file you are writing to should use. You should not go only part of the way. – Martijn Pieters Jun 11 '13 at 13:13
  • Traceback (most recent call last): File C:\\path\\to\\concat.py, line 15, in outfile.write(fname + '\n\n') File C:\\Python27\\codecs.py, line 691, in write return self.writer.write(data) File C:\\Python27\\codecs.py, line 351, in write data, consumed = self.encode(object, self.errors) Unicode DecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal not in range (128) –  Jun 11 '13 at 13:17
  • @LDN-5602: That means you are trying to write byte strings (*not* Unicode values) and in order to encode those to UTF-8, they need to be decoded to Unicode first. Using the default codec, ASCII, fails for those. Are you certain you are opening the files-to-read with an encoding too? In other words, are you applying **all** the changes I suggested? – Martijn Pieters Jun 11 '13 at 13:19
  • Forgot to edit `# -*- coding: utf-8 -*-` but still looks like the same error to me with the exception of the first line. Yes i am positive just checked it again. –  Jun 11 '13 at 13:20
  • It is not the codec that is the problem. Use `print repr(line)` to show you what type of data you have. If there is no `u` at the start you do *not* have unicode. – Martijn Pieters Jun 11 '13 at 13:22
  • @Bakuriu @Martijn Pieters `print repr(line)` gave back this for example: Agroqu\xc2\xc4\xb1mica, instead of Agroquimica. Does this verify your saying and if yes then what? –  Jun 11 '13 at 13:27
  • Are there no quotes around that? Is it `u'Agroqu\xc2\xc4\xb1mica'` or `'Agroqu\xc2\xc4\xb1mica'`? – Martijn Pieters Jun 11 '13 at 13:37
  • Do you know what encoding was used for that file? Those bytes are *not* UTF8. – Martijn Pieters Jun 11 '13 at 13:38
  • Maybe it is latin-1 or something similar which is not supported by python if i am not mistaken. This phrase is enclosed with ' ' –  Jun 11 '13 at 13:44
  • btw your answer is accepted for improving and educating my code. thanks for the efforts –  Jun 11 '13 at 14:02
  • @LDN-5602: It is *not* Latin 1 (which is supported by Python just fine); not sure what encoding that is though.. You are trying to write byte strings, not Unicode strings. This means `infile` is not being decoded, which in turn means you didn't specify a `encoding` paramater. – Martijn Pieters Jun 11 '13 at 14:04
  • then maybe `currentfile = os.path.join(dirpath, fname)` method prevents `with codecs.open(currentfile, encoding='utf8') as infile:` to be identified correctly. –  Jun 11 '13 at 14:10
  • @LDN-5602: No, that won't be the case. I'm wondering if I missed something here; what does `print type(line)` give you when you add that test *before* as well as after the `line.replace()` lines; that'd let you verify that you have unicode and you *still* have unicode by the time you write out to disk. – Martijn Pieters Jun 11 '13 at 14:14
  • `for line in infile: print type(line)` gave `` and `outfile.write(line) print type(line)` gave also `` so i guess unicode it is –  Jun 11 '13 at 14:22
  • @LDN-5602: Hey, and you **still** get decode errors? `print repr(line)` will still show you `''` byte string data? You have unicode objects there, so the `codec.open(..., encoding='..')` is working. – Martijn Pieters Jun 11 '13 at 14:24
  • We are missing something (or i've screwed somewhere which is more than probable).. –  Jun 11 '13 at 14:27
  • for example why does it generate an error in line 15 of the module `outfile.write(fname)` after the changes and maybe this is pertinent? –  Jun 11 '13 at 14:38
  • @LDN-5602: No idea; perhaps you made a syntax error somewhere? What is the error? Are there enough closing parenthesis to match all closing parenthesis on previous lines? – Martijn Pieters Jun 11 '13 at 14:42
  • it is not a syntactical error. the message: `File C:\\path\\to\\concat.py, line 15, in outfile.write(fname + '\n\n')` appears, so the generated text file is empty line less. –  Jun 11 '13 at 14:50
  • Not sure what you mean by that; you didn't tell me the error message you have. – Martijn Pieters Jun 11 '13 at 15:00
  • `Traceback (most recent call last): File C:\\path\\to\\concat.py, line 15, in outfile.write(fname + '\n\n')`; those are the first two lines of the cmd message. It might not appear as an error but still i get the text in the file without the empty lines which means that `outfile.write(fname + '\n\n')` does not apply. Sorry for not be comprehensible before –  Jun 11 '13 at 15:08
  • @LDN-5602: Aha! Your **filename** contains non-ASCII characters too. Updated to let `os.listdir()` return unicode too. – Martijn Pieters Jun 11 '13 at 15:14
  • Yes! That solved the problem and i might don't even have to replace anything now. Really you saved me some time and i learned some new things. Thanks a lot! –  Jun 12 '13 at 11:58