0

edit using utf-16 seems to get me closer in the right direction, but I have csv values that include commas such as "one example value is a description, which is long and can include commas, and quotes"

So with my current code:

filepath="csv_input/frups.csv"

rows = []
with open(filepath, encoding='utf-16') as f:
    for line in f:
        print('line=',line)
        formatted_line=line.strip().split(",")
        print('formatted_line=',formatted_line)
        rows.append(formatted_line)
        print('')

Lines get formatted incorrectly:


line= "FRUPS"   "11111112"        "Paahou 11111112, 11111112,11111112"    "Bar, Achal"      "Iagress"   "Unassigned"    "Normal"        "GaWu , Suaair center will not be able to repair 3 couch part 11111112, 11111112,11111112 . Pleasa to repair .

formatted_line= ['"FRUPS"\t"11111112"\t"Parts not able to repair in Suzhou 11111112', ' 11111112', '11111112"\t"Baaaaaar', ' Acaaaal"\t"In Progress"\t"Unassigned"\t"Normal"\t"Got coaow Wu ', ' Suar cat 11111112', ' 11111112', '11111112. Pleasa to repair .']

line= 11111112

formatted_line= ['11111112']

So in this example, the line is separated by long spaces, but breaking up by commas is not as reliable for reading data line by line correctly


I am trying to read a csv line by line in python but each solution leads to a different error.

  1. Using pandas:
filepath="csv_input/frups.csv"
data = pd.read_csv(filepath, encoding='utf-16')
for thing in data:
    print(thing)
    print('')

Fails to read_csv the file with an error Error tokenizing data. C error: Expected 7 fields in line 16, saw 8

  1. Using csv_reader
# open file in read mode
with open(filepath, 'r') as read_obj:
    # pass the file object to reader() to get the reader object
    csv_reader = reader(read_obj)
    # Iterate over each row in the csv using reader object
    for row in csv_reader:
        # row variable is a list that represents a row in csv
        print(row)

Fails with error at for row in csv_reader line with line contains NUL

I've tried to figure out what these NUL characters our but trying to investigate using code leads to different errors:

data = open(filepath, 'rb').read()
print(data.find('\x00'))

error: argument should be integer or bytes-like object, not 'str'
  1. another read solution trying to strip certain characters

with open(filepath,'rb') as f:
    contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")

error: TypeError: a bytes-like object is required, not 'str'

It seems like my csv has some weird characters that cause python to error out. I can open and view my csv just fine in excel, how can I read my csv line by line? Such as

row[0]=['col1','col2','col3']
row[1]=['val1','val2','val3']
etc...
POVR2
  • 87
  • 10
  • It seems your CSV has a variable number of columns, leading to the pandas error. In #2, you didn't open in utf-16 mode and got a NUL - so add the utf-16 encoding there. You could count the lines as you read them, find the first with an extra column and know where the problem is. On #3 you opened in binary mode, but you should have used a bytes string `b"\x00"` on find. – tdelaney Jul 21 '22 at 19:26
  • Does this CSV file have a header? You are getting it from excel..., is the table always a consistent number of rows? Pandas will assume a column count from the first line in the csv. – tdelaney Jul 21 '22 at 19:29
  • @tdelaney I added `with open(filepath, encoding='utf-16')` to #2, which will read data, but that data is separated incorrectly, where sometimes it prints out a single value saying its a line, because my lines contain values with commas. – POVR2 Jul 21 '22 at 19:34
  • @tdelaney for #3, im not sure what you mean, if I should change `contents = contents.split("\r\n")` to `contents = contents.split(b"\x00")` ? or soemething. I'm not sure if my csv has a header. And I believe it is a consistent number of rows? There are 3701 rows, with consistent values for each column I think is what your asking – POVR2 Jul 21 '22 at 19:35
  • You could open in binary and read lines until you find a few lines representative of the problem. Post those, still in bytes format, and we will have a better feel for the format. – tdelaney Jul 21 '22 at 19:51

2 Answers2

0

You can always read the file manually to build such a structure

rows = []
with open(filepath) as f:
    for line in f:
        rows.append(line.strip().split(","))
Mateo Torres
  • 1,545
  • 1
  • 13
  • 22
  • Do I need to do something to decode the values? When I run that code, I get results in my `rows` object that look like `[['ÿþ"\x00T\x00y\x00p\x00e\x00"\x00\t\x00"\x00I\x00d\x00"\x00\t\x00"\x00S\x00u\x00m\x00m\x00a\x00r\x00y\x00"\x00\t\x00"\x00O\x00w\x00n\x00e\x00d\x00 \x00B\x00y\x00"\x00\t\x00"\x00S\x00t\x00a\x00t\x00u\x00s\x00"\x00\t\x00"\x00P\x00r\x00i\x00o\x00r\x00i\x00t\x00y\x00"\x00\t\x00"\x00S\x00e\x00v\x00e\x00r\x00i\` – POVR2 Jul 21 '22 at 19:13
  • It seems that the file you're dealing with is encoded in `utf-16`, you might want to have a look at [this answer](https://stackoverflow.com/questions/2144815/how-to-know-the-encoding-of-a-file-in-python) – Mateo Torres Jul 21 '22 at 19:15
  • I updated my question with my current status, using utf-16 works better, but some of my csv values have commas, so separating by commas is not reliable. But my excel view of the csv again is perfect, theres got to be some way to replicate that view right? – POVR2 Jul 21 '22 at 19:26
  • If there are embedded commas - are they enclosed in quotes? excel should have escaped them properly. – tdelaney Jul 21 '22 at 19:27
  • yes they should all be enclosed in double quotes, if i open my csv in visual studio, I can see lines separated like this ` "Hlosek, Nancy" "In Progress" "Unassigned" ` but sometimes if there are blank columns, they appear like so ` "" "" "(HC) CT FRU" "" "" ` without consistent spacing between each value – POVR2 Jul 21 '22 at 19:39
0

What you have shown at line and formatted_line is a hint that:

  • your file is utf-16 encoded
  • it uses tabs (\t) as delimiters

So you should use:

  1. with the csv module:

     # open file in read mode
     with open(filepath, 'r', encoding='utf-16') as read_obj:
         # pass the file object to reader() to get the reader object
         csv_reader = reader(read_obj, delimiter='\t')
         # Iterate over each row in the csv using reader object
         for row in csv_reader:
             # row variable is a list that represents a row in csv
             print(row)
    
  2. with Pandas:

     data = pd.read_csv(filepath, encoding='utf-16', sep='\t')
     for thing in data:
         print(thing)
         print('')
    
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252