2

After executing the following code to generate a copy of a text file with Python, the newfile.txt doesn't have the exact same file size as oldfile.txt.

with open('oldfile.txt','r') as a, open('newfile.txt','w') as b:
    content = a.read()
    b.write(content)

While oldfile.txt has e.g. 667 KB, newfile.txt has 681 KB.

Does anyone have an explanation for that?

Jongware
  • 22,200
  • 8
  • 54
  • 100
Matthias
  • 105
  • 10
  • 2
    Is it difficult for you to check two files for differences? There are tools for that. (But my guess is you may find it's End-of-line related.) – Jongware Mar 14 '18 at 09:31
  • [When to open file in binary mode (b)?](https://stackoverflow.com/questions/31483253/when-to-open-file-in-binary-mode-b) answers that, with longer explanations of "text" mode than in the answers here. – Jongware Mar 14 '18 at 09:59
  • Are you a Windows user, right? – Giacomo Catenazzi Mar 14 '18 at 10:05
  • Yes, I generated oldfile.txt in Windows. It's newline characters were `\r\n`, while the newline characters in newfile.txt were `\n`. I see, opening the files in binary mode `with open('oldfile.txt', 'rb') as a, open('newfile.txt', 'wb') as b:`... preserves the newline characters. – Matthias Apr 03 '19 at 20:56

2 Answers2

1

There are various causes.

You are opening a file as text file, so the bytes of file are interpreted (decoded) into python, and than encoded. So there could be changes.

From open documentation (https://docs.python.org/3/library/functions.html#open):

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.

So if the original file were ASCII (e.g. generated in Windows), you will have the \r removed. But when writing back the file you can have no more the original \r (if you are in Linux or MacOs) or you will have always \r\n, if you are on Windows (which it seems the case, because you file increase in size).

Also encoding could change text. E.g. BOM mark could be removed (or added), and potentially (but AFAIK it is not done implicitly), unneeded codes could be removed (you can have some extra code in Unicode, which change the behaviour of nearby codes. One could add more of one of them, but only the last one is effective.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
0

I tried on Linux / Ubuntu. It works as expected, the file-size of both files is perfectly equal.

At this point, i guess this behavior does not relate to python, maybe it depends on your filesystem (compression) or operating system.

r4r3devAut
  • 90
  • 10