1

I try to read from a dataset and I want all elements except the last one in train. I get the last element as target. I can print it and all good but when the code reaches train = ... then I get this error: IndexError: invalid index

dataset = np.genfromtxt(open(train_file,'r'), delimiter=',',dtype=None)[1:]
target = [x[401] for x in dataset]
train = [x[0:400] for x in dataset]

I also tried: [x[:-1] for x in dataset] but I get the same error.

Data set is big but this is a sample:

xxx,-0.011451,-0.070532,...,-0.011451,-0.070532,O

sage88
  • 4,104
  • 4
  • 31
  • 41
Nick
  • 367
  • 4
  • 6
  • 13
  • 1
    What's wrong with `dataset[:-1]` ? – Avinash Raj Apr 17 '15 at 23:50
  • Because of want the first 401 elements of all elements in dataset. Dataset is the array of lists. – Nick Apr 17 '15 at 23:55
  • could it be that when you've gotten to `train` you're at the end of the file, so `x` is `None`? – abcd Apr 17 '15 at 23:57
  • what is `genfromtxt`? – abcd Apr 17 '15 at 23:57
  • You haven't provided any information about `dataset`, `genfromtxt()`, or `train_file`. Any answers will just be guesses, trying to bruteforce the solution. – TigerhawkT3 Apr 17 '15 at 23:58
  • 2
    Perhaps it would help if you found a different wording instead of "once hit the train will pop out this error" because that makes absolutely zero sense. You might try a longer code sample and a traceback. – Paul Cornelius Apr 18 '15 at 00:01
  • @TigerhawkT3 [genfromtxt()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html) is part of NumPy. – Tutleman Apr 18 '15 at 00:12
  • I see that now... after it was mentioned in a comment on an answer. That sort of information should be present in the question's text and/or a tag to encourage useful answers. – TigerhawkT3 Apr 18 '15 at 00:18
  • Yes. You are right. My apology. – Nick Apr 18 '15 at 00:19
  • It would be strange that target would work and train wouldn't, but have you checked this just to see if the lengths are all as expected? for x in dataset: len(x) – sage88 Apr 18 '15 at 00:31
  • Yes. all 402 columns. – Nick Apr 18 '15 at 00:32
  • Can you get it to work using numpy access syntax: train = dataset[0:len(dataset)][0:400] – sage88 Apr 18 '15 at 00:41
  • Also, just so you know, you have an off by 1 error in train. It should be: train = [x[0:401] for x in dataset] – sage88 Apr 18 '15 at 00:58
  • What is ```dataset```'s shape? – wwii Apr 18 '15 at 04:29

2 Answers2

2

Your issue appears to be with understanding how list comprehensions work, and when you might want to use one.

A list comprehension goes through every item in an list, applies a function to it, and may or may not filter out other elements. For instance, if I had the following list:

digits = [1, 2, 3, 4, 5, 6, 7]

And I used the following list comprehension:

squares = [i * i for i in digits]

I would get: [1, 4, 9, 16, 25, 36, 49]

I could also do something like this:

even_squares = [i * i for i in digits if i % 2 == 0]

Which would give me: [4, 16, 36]

Now let's talk about your list comprehensions in particular. You wrote [x[401] for x in dataset], which, in English, reads as "a list containing the 401st element of each item in the list called dataset".

Now, in all likelihood, there aren't more than 402 items in each line of your dataset, meaning that, when you try to access the 401st element of each, you get an error.

It sounds like you're just trying to get all the elements in dataset excluding the last one. To do that, you can use python's slice notation. If you write dataset[:-1], you'll get all items in the dataset other than the last one. Similarly, if you wrote dataset[:-2], you'd get all items except for the last two, and so on. The same works if you want to cut off the front of the list: dataset[1:-1] will give you all items in the list excluding the 0th and last items.

Edit: Now that I see the new comments on your post, it's clear that you are trying to get the first 401 elements of each item in the dataset. Unfortunately, because we don't know anything about your dataset, it's impossible to say what exactly the issue is.

Tutleman
  • 740
  • 8
  • 15
  • You got me wrong. I have 402 items in each line. I can pass and get `[x[401] for x in dataset]` with no problem. When I do `dataset[:-1]`, I get that error. – Nick Apr 18 '15 at 00:05
  • @Nick Can you give us some more information? A good place to start would be the rest of the error message. – Tutleman Apr 18 '15 at 00:07
  • There is no rest of error message. It will give me this error and exit. I guess might get something to do with numpy. I also added a sample of my data but it is too big to copy here. – Nick Apr 18 '15 at 00:09
  • @Nick It is possible that your columns have different dtypes? That could cause this issue. – Tutleman Apr 18 '15 at 00:13
  • Well, my columns are all number except the last one. The first one is integer (0 or 1 or 2), the next 400 are negative and positive float numbers and the last one is string. – Nick Apr 18 '15 at 00:37
  • @Nick I bet that's your problem. See [this question](http://stackoverflow.com/questions/7093431/numpy-array-column-slicing-produces-indexerror-invalid-index-exception). – Tutleman Apr 18 '15 at 04:44
1

I just tested this with the following toy code. Your syntax is actually correct. Something is wrong with your input file, not with the way you are selecting elements from your list of arrays.

from numpy import *

a = array(range(1,403))

dataset = []
for i in range(5):
    dataset.append(a)

target = [x[401] for x in dataset]
train = [x[0:400] for x in dataset]
sage88
  • 4,104
  • 4
  • 31
  • 41
  • The strange thing is that when I test data with `dataset = list(csv.reader(open(train_file, 'rU')))` I don't get any error. It is strange. I have another dataset that works well with np.genfromtxt – Nick Apr 18 '15 at 00:31