0

I have a csv file that I'm trying to convert its data into Dense Design Matrix using Pylearn2 module. However what causes error is not related to pylearn2 but to my personal implementation.

import csv
def load_data(fileName_X, start, stop):
    with open(fileName_X, 'r') as f:
        reader_X = csv.reader(f, delimiter=';')
        X = []
        header = True
        size=0
        for row_X in reader_X:
            if header:
                header = False
                continue
            row_X = [float(elem_X) for elem_X in row_X]
            size+=1
            X.append(row_X)
        X = np.asarray(X)

if __name__ == "__main__":
    train = load_data(sys.argv[1], 0, 10)

Here is some rows of the .csv file:

-1;-0.844511;-0.339286;-1;0.0769231;-0.25;-0.929825;1;1;-0.880597;1;0;-0.92;-0.99
1;-0.796992;-0.8275;-1;0.0769231;-0.25;-0.859649;1;1;-0.671642;1;-1;-0.8;-0.94
1;-0.611429;-0.875;-1;0.0769231;-0.25;-0.929825;1;1;-0.850746;1;-1;-0.84;-0.88446
-1;-0.661654;-0.119286;-1;0.846154;0.75;-0.754386;1;1;-0.820896;-1;-1;-0.6;-0.99084

Normally what I expect is to have all the values of each row which are separated by ; in row_X. But as I run the program, I get this error:

Traceback (most recent call last):
  File "make_dataset.py", line 68, in <module>
    train = load_data(sys.argv[1], 0, 10)
  File "make_dataset.py", line 53, in load_data
    row_X = [float(elem_X) for elem_X in row_X]
ValueError: invalid literal for float(): 1;-0.796992;-0.8275;-1;0.0769231;-0.25;-0.859649;1;1;-0.671642;1;-1;-0.8;-0.94

What I can't find out the reason is that the program works properly with another csv file which contains these rows (just to compare its possible different form with the first one):

7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
poke
  • 369,085
  • 72
  • 557
  • 602
Sina
  • 209
  • 1
  • 3
  • 11
  • Did you read this? http://stackoverflow.com/questions/14145460/python-convert-negative-decimals-from-string-to-float – FirebladeDan Jul 16 '15 at 19:46
  • A quick aside: To skip a header row just do `next(reader_X)` before your for-loop. – Steven Rumbalski Jul 16 '15 at 19:46
  • This doesn't directly address your bug, but you [might want use](http://stackoverflow.com/questions/4315506/load-csv-into-2d-matrix-with-numpy-for-plotting) [`numpy.loadtxt`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt) – Steven Rumbalski Jul 16 '15 at 19:51
  • @FirebladeDan I've just tried to discover if the problem is caused by the negative sign in some rows with iPython. Unfortunately this is not the case. _float()_ works properly with negative signs even. As for spaces, I'm sure there is no space in none of the rows. – Sina Jul 16 '15 at 19:52
  • 1
    It's trying to parse the entire line into a single number. You need to split it up and parse each element in the line. – TigerhawkT3 Jul 16 '15 at 19:55
  • Perfect idea @StevenRumbalski. I'll try that if this could not be solved by tonight ;) – Sina Jul 16 '15 at 19:55
  • From the [Python docs](https://docs.python.org/2/library/csv.html): *If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.* Don't know if that's the problem, but it's worth a shot. – Mark Ransom Jul 16 '15 at 19:56
  • 1
    @TigerhawkT3: You are correct, but he's using `csv.reader` which should be splitting the line correctly. – Steven Rumbalski Jul 16 '15 at 19:56
  • 1
    @MarkRansom: It's important to note that the binary flag is needed in Python 2 only. So if OP is using Python 2, he should try this. In Python 3 the docs change to "If csvfile is a file object, it should be opened with `newline=''`." – Steven Rumbalski Jul 16 '15 at 19:58
  • This is a critical point. What platform/version of Python are you running? – rabbit Jul 16 '15 at 19:58
  • 1
    @StevenRumbalski it's not just Python 2, but Python 2 on Windows - Linux qualifies as a platform where 'b' doesn't make a difference. – Mark Ransom Jul 16 '15 at 20:00
  • @NathanBartley Python 2.7.10 (if the OS is important, it's Linux Debian). I'm just tying to check all the responses if they may solve the problem. Thanks for all. – Sina Jul 16 '15 at 20:01
  • That's very strange because I ran your code & bad input verbatim on OSX/2.7.2 and could not replicate your problem. – rabbit Jul 16 '15 at 20:03
  • This code (with the example data) works with both Python 2 and Python 3 for me. Since you seem to not get the error on the first row but only with the second, can you add a `print(repr(row_X))` before the float line to see what Python sees? – poke Jul 16 '15 at 20:07
  • @poke: I believe the reason he sees the error on row 2 is that he is skipping a non-existent header row. – Steven Rumbalski Jul 16 '15 at 20:10
  • I've tried adding a header to the csv file (the first one which produces the error). It seems that the problem is caused by the second row, whatever it be: `ValueError: invalid literal for float(): -1;-0.844511;-0.339286;-1;0.0769231;-0.25;-0.929825;1;1;-0.880597;1;0;-0.92;-0.99`.What is still shady to me is that the second csv file doesn't produce any error as I run the program (You may take a deeper look into the second dataset here: [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) Click on _Data Folder_ link plz) – Sina Jul 16 '15 at 20:23
  • Instead of using `csv`, have you tried using `numpy.fromtxt(f, delimiter=';')` after skipping the header? – codewarrior Jul 17 '15 at 00:25
  • @codewarrior I could temporarily resolve the problem using `reader_X = np.genfromtxt(fileName_X, delimiter=';')`. But the problem using csv rests always. – Sina Jul 17 '15 at 14:46

0 Answers0