Open .txt file in Python after skipping lines - Encoding issue

Question

I am trying to open a .txt file in Python.

Before flagging this of as repeat, please do take a look at the code and the file below.

I have used this snippet to read similar files before, however this particular batch of files does not work.

location="sample/sample2/"
filename=location+"Detector_-3000um.txt"
skip=25 #Skip the first 25 lines

The code to open it is -

f=open(filename)
num_lines = sum(1 for line in f)
print "Skipping the first "+str(skip)+" lines"
data=np.zeros((num_lines-skip+1,num_lines-skip+1))
f.close()
f=open(filename)
i=0
for _ in range(skip):  #skip unwanted rows
    next(f)
for line in f:
    data[i,:]=line.split()
    i+=1
f.close()

Its a 501x501 data set with the first row and column being the row and column numbers resp.

The data set is attached here.

I also tried using panda - pd.read_csv(filename,skiprows) however it gives this error -

CParserError: Error tokenizing data. C error: Expected 1 fields in line 49, saw 501

`with open(filename) as f:` coupled with `f.seek(0)` to go back to the start will clean this up considerably. — TemporalWolf, Mar 30 '17 at 20:49
any specific reason you `f.close()` right before `f=open(filename)` ? — Chris, Mar 30 '17 at 20:52
f.close() to reset the counter. The initial set was to count the number of lines, and then the next set is to read the file. — shbhuk, Mar 30 '17 at 20:57
in terms of using pandas I recommend you check out this: http://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data — Chris, Mar 30 '17 at 21:07
I would consider updating the title since this seems to have more to do with correct parsing of CSV/TSV data than anything to do with skipping lines in a file per se. — Iguananaut, Mar 31 '17 at 12:12

score 1 · Answer 1 · edited May 23 '17 at 12:09

1

I think, there is nothing wrong with your code, the problem is the file encoding.

I converted your file encoding to 'utf-8', then both your code and read_csv() from pandas work properly.

pd.read_csv(myfile, skiprows=24, header=0, index_col=0,sep='\t')

There are many ways to convert the encoding, for example use notepad++(windows), the way I did or please see here: How to convert a file to utf-8 in Python?

edited May 23 '17 at 12:09

Community

1
1

answered Mar 30 '17 at 23:34

xirururu

5,028
9
35
64

That was it. On conversion the files work. For future reference, what can be an indicator of an encoding issue? Basically what should I look out for? Is there a simple function check? – shbhuk Mar 31 '17 at 01:41
@shbhuk I would add a check at beginning: 1) find out whether the code is "utf-8" encoded. 2) If not, then I will convert it into "utf-8", else continue. I think, this should be the easiest way. Hier is a post about how to do that: http://stackoverflow.com/questions/6707657/python-detect-charset-and-convert-to-utf-8 – xirururu Mar 31 '17 at 13:13
You can use something like [chardet](https://github.com/chardet/chardet) to automate encoding detection. It will usually do the job, but there might be edge cases where you get false results. When you know the correct encoding, you can pass that to the `open` function. `with open(filename, encoding='ascii') as src:` – Håken Lid Mar 31 '17 at 13:33

Open .txt file in Python after skipping lines - Encoding issue

1 Answers1