3

I am trying to perform some replacements in a file:

'\t' --> '◊'
 '⁞' --> '\t'

This question recommends the following procedure:

import fileinput

with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
    for line in file:
        line = line.replace('\t','◊')
        print(line.replace('⁞','\t'), end='')

I am not allowed to comment there, but when I run this piece of code I get an error saying:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 10: character maps to <undefined>

This kind of error I have remedied previously by adding encoding='utf-8'. The problem is that fileinput.FileInput() does not allow for an encoding argument.

Question: How to get rid of this error?


The above solution, if it would work and provided that the speed is comparable to the following method, would please me most. It seems to be doing inplace replacements as it should be done.

I have tried also:

replacements = {'\t':'◊', '⁞':'\t'}
with open(filename, encoding='utf-8') as inFile:
    contents = inFile.read()
with open(filename, mode='w', encoding='utf-8') as outFile:
    for i in replacements.keys():
        contents = contents.replace(i, replacements[i])
    outFile.write(contents)

which is relatively fast, but very greedy when it comes to memory.


For UNIX users, I need something which does the following thing:

sed -i 's/\t/◊/g' 'file.csv'
sed -i 's/⁞/\t/g' 'file.csv'

This turns out to be rather slow.

Sandu Ursu
  • 1,181
  • 1
  • 18
  • 28

1 Answers1

1

Generally, with FileInput you can specify the encoding that you want passing a fileinput.hook_encoded as openhook parameter:

import fileinput

with fileinput.FileInput(filename, openhook=fileinput.hook_encoded('utf-8')) as file:
    # ...

However, that does not work with inplace=True. In this case, you can treat the file as a binary and decode/encode the strings by yourself. For reading, this can be done just specifying mode='rb', which will give you bytes instead of str lines. For writing it's a bit more complicated, because print always uses str, or converts the given input to str, so passing bytes will not work as expected. You can, however, write binary data to sys.stdout directly, and this will work:

import sys
import fileinput

filename = '...'
with fileinput.FileInput(filename, mode='rb', inplace=True, backup='.bak') as file:
    for line in file:
        line = line.decode('utf-8')
        line = line.replace('\t', '◊')
        line = line.replace('⁞', '\t')
        sys.stdout.buffer.write(line.encode('utf-8'))
jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • And it is ~2 times faster (in my case) than the working method I have there in the post. Negligible amount of RAM used. This is fantastic @jdehesa! Thank you! – Sandu Ursu Apr 04 '18 at 09:55