0

I have written the following python code that should convert a file to UTF8. It works well but I noticed that if the file is too big (in this case we are talking of 10GB of file!) the program crashes!

In general it seems that it takes too much time: 9minutes to convert a 2GB of text files: maybe I can make it more efficient? I think it's because I'm first reading the whole file and then save it, could be that?

  import sys
  import codecs

  filename= sys.argv[1]
  with codecs.open(filename, 'r', encoding='iso-8859-1') as f:
    text = f.read()
  with codecs.open(filename, 'w', encoding='utf8') as f:
    f.write(text)
Phate
  • 6,066
  • 15
  • 73
  • 138

1 Answers1

1

Yes, this may happen because you're reading the whole file in one line. It's better to read this file by pieces, convert them to utf-8 and then write those pieces to another file.

import sys
import codecs

BLOCKSIZE = 1048576 # or some other, desired size in bytes

sourceFileName = sys.argv[1]
targetFileName = sourceFileName + '-converted'

with codecs.open(sourceFileName, "r", "iso-8859-1") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents: break
            targetFile.write(contents)

I took code from this question (And modified it a bit)