0
import sys
dataset = open('file-00.csv','r')
dataset_l = dataset.readlines()

When opening the above file, I get the following error:

**UnicodeDecodeError: 'utf-8' codec cant decode byte 0xfe in position 156: invalide start byte**

So I changed code to below

import sys
dataset = open('file-00.csv','r', errors='replace')
dataset_l = dataset.readlines() 

I also tried errors='ignore' but for both the initial error now dissapears but later in my code i get another error:

def find_class_1(row):
    global file_l_sp
    for line in file_l_sp:
        if line[0] == row[2] and line[1] == row[4] and line[2] == row[5]:
            return line[3].strip()
    return 'other'

File "Label_Classify_Dataset.py", line 56, in

dataset_w_label += dataset_l[it].strip() + ',' + find_class_1(l) + ',' + find_class_2(l) + '\n'

File "Label_Classify_Dataset.py", line 40, in find_class_1

if line[0] == row[2] and line[1] == row[4] and line[2] == row[5]:strong text



IndexError: list index out of range

How can I either fix the first or the second error ?

UPDATE....

I have used readline to enumerate and print each line, and have managed to work out which line is causing the error. It is indeed some random character but tshark must have substituted. Deleting this removes the error, but obviously I would rather skip over the lines rather than delete them

with open('file.csv') as f:
    for i, line in enumerate(f):
        print('{} = {}'.format(i+1, line.strip()))

Im sure there is a better way to do enumerate lol

Bat
  • 145
  • 1
  • 3
  • 14
  • Try to open the file with 'rb', like `dataset = open('file-00.csv','rb')` – Raunaq Jain Aug 27 '18 at 11:29
  • Without seeing the data it's quite hard to guess. What encoding does it really have? – Thomas Weller Aug 27 '18 at 11:29
  • Don't ignore encoding errors. Open the file with the right encoding. Obviously `utf8` is not the right encoding. Also, don't use `.readlines()` and `.split()` for CSV files, use the `csv` module. Thirdly, avoid global variables. They are not necessary for what you do here. – Tomalak Aug 27 '18 at 11:29
  • @RaunaqJain thanks I tried that see comment below for new error lol – Bat Aug 27 '18 at 12:30
  • @ThomasWeller the data should be utf8 as it was a pcap file which was converted to csv using LibreOffice with utf8 specified – Bat Aug 27 '18 at 12:31
  • @Tomalak ah ok I think i tried this approach initially but wasn't getting anywhere do changed to using readlines. – Bat Aug 27 '18 at 12:32
  • sorry everyone im new to coding so im pretty sure my code isnt very good (clearly) lol – Bat Aug 27 '18 at 12:33
  • Wireshark PCAP files likely contain binary data. Not all binary data can simply be put into a UTF-8 file. Maybe you should be looking in how to open a PCAP file directly instead of converting it into a (potentially invalid) different format. Like https://stackoverflow.com/questions/4948043/how-to-parse-packets-in-a-python-library I definitely wonder about opening PCAP files with LibreOffice. It's just not the right tool. – Thomas Weller Aug 27 '18 at 13:28
  • @ThomasWeller ok thanks I used tshark to directly ouput to .csv file (as one of the built in methods you can use to export). I have used readline and printed each line, enumerating through the file, and have managed to find the lines with the problem. I have deleted these just (0.2% of dataset) and error has disappeared, but it would be great leave them in the dataset but have Python skip them – Bat Aug 27 '18 at 14:18

1 Answers1

0

Try the following;

dataset = open('file-00.csv','rb')

That b in the mode specifier in the open() states that the file shall be treated as binary, so the contents will remain as bytes. No decoding will be performed like this.

Nordle
  • 2,915
  • 3
  • 16
  • 34
  • ok thanks. I tried 'rb' but again this then causes a later error since a new column (string) is added once it has been classified so I get the error: Type Error: cant concat str to bytes – Bat Aug 27 '18 at 12:29