1

I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python). The thing is, I have two types of files:

First file is in the form:

text:more text
another text:and more

and the second file in the form:

text_file.txt
anothertext_file.txt

Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say

if something == "text":
    ...

will not get executed even if the "something" is the same as the "text".

So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:

if not "." in line:
    ...

and in the other filetype:

if not ":" in line:
    ...

Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.

Would appreciate any help understanding what's happening here!

Edit: as you people have been asking for my code, here it is:

filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
    filelist.append(line.rstrip("\n"))

This does not work properly. Also I tried it like mxds said,

filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
    filelist.append(line.rstrip("\n"))

and this does not work either. It is only a problem in the files in the first character of the first line.

Edit2: It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.

Final edit: Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?

vivas
  • 13
  • 5
  • Depending on exactly how you read the file, each line may or may not end with a newline character. – John Gordon Jun 17 '16 at 21:41
  • 1
    ... and it may begin with a BOM (byte order mark - shiver). – Dilettant Jun 17 '16 at 21:42
  • 1
    I would say that you should fix the way your program is reading your files before you look at this. It's possible these two problems are connected. If you post your code, we can help you with that. – kirkpatt Jun 17 '16 at 21:43
  • Possibly use a text editor to analyze hidden or non standard data or characters in the file – chickity china chinese chicken Jun 17 '16 at 21:47
  • 1
    @Dilettant after googling what BOM is that is most definitely what my problem is! Now, if only I could figure out what creates the BOM and how to get rid of it as it seems to be an optional character... "BOM use is optional" (wiki) – vivas Jun 18 '16 at 07:48

3 Answers3

1

Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)

Alternatively, you can strip the BOM in Python with:

line = line.decode("utf-8-sig").encode("utf-8")

See https://docs.python.org/3/library/codecs.html:

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.

...

On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

Community
  • 1
  • 1
leekaiinthesky
  • 5,413
  • 4
  • 28
  • 39
0

A classic approach to reading text files in Python is:

with open(fname, 'r') as f:
    lines = f.readlines()

After which you can process the lines like this:

for line in lines:
    # do something with line...

As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.

mdxs
  • 147
  • 9
0

I just had similar issue: python readlines() reports invalid chars heading the first line, something like . I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file

if len(line[i]) > len(line[0]):
   do things
else: 
   skipping

in my case, the len(line[0] = 4, all other lines are longer than 4

Heinz
  • 913
  • 4
  • 12
  • 22