Subsampling pandas dataframe with float index - ValueError

Question

This question describes exactly what I want to do and this answer works perfectly on my example data. However, at some point I run into a problem when working with my real, much larger dataset. In my real dataset, I would like to subsample to every hundredth point. Currently, the index goes like 259.05, 259.06, 259.07, 259.08, 259.09, 259.1, 259.11, 259.12, 259.13, 259.14... and I would like to subsample it just to 259, 260, 261... But I would like to start at some reasonable numbers, such as 260 or at least 259.5.

However, when I get to the point as suggested in the abovementioned answer, the following code works:

s = (df.index.to_series()).astype(int)
df.groupby(s).mean().set_index(s.index[13::100])

producing 259.18, 260.18, 261.18 .... But If I start at any higher point, df.groupby(s).mean().set_index(s.index[14::100]) I get: ValueError: Length mismatch: Expected axis has 635 elements, new values have 634 elements

Long story short: Input:

index   some data
259.05  x
259.06  x
259.07  x
259.08  x
259.09  x
259.1   x
259.11  x
259.12  x
259.13  x
259.14  x
259.15  x
…   …

Desired output:

index   some data
260 mean x
261 mean x
262 mean x
263 mean x
264 mean x
265 mean x
266 mean x
267 mean x
268 mean x
269 mean x
270 mean x
…   …

Apparently this is because the length of the data is not sufficient for another full 100. So how can I achieve having it sampled at the desired points?

What is the size of your dataset? Using `[14::100]` goes in steps of 100, so it may just be that you reached the point where it can't take another full step of 100. — BrenBarn, Dec 15 '16 at 03:24
It is true that the length is not divisible by 100. But from that i would assume that it would work for indexes that are divisible by 100 and not work for those that aren't. But this works for everything between 0-13, so there are multiple points where it can't take another full 100 but it still works... or am i missing something? — durbachit, Dec 15 '16 at 05:26
It doesn't matter whether it's divisible by 100. What matters is how many hundreds of elements are left in the rest of the list. Imagine your list has 314 elements (indexed 0 to 313). Then doing `[1::100]` will get elements 1, 101, 201, 301. `[2::100]` will get 2, 102, 202, 302. This will work until you get to 14, at which point `[14::100]` will return only 14, 114, 214, because there is no element 314. I can;t say for sure if this is the cause of the problem you're describing, but it sounds like it may be. — BrenBarn, Dec 15 '16 at 05:33
Oh, I see what you mean. But I would be completely happy with that, I want the solution that has the 14, 114 and 214 and I am happy to chop off the last 99 elements like I chopped off the first 14. Why does it throw the error when it has to discard 99 values but it's ok with discarding 1 or 12 values (when starting at numbers 0-13)? Oh I see, it's when we reach the point when we have fewer elements, that's what the error implies. You are right, it definitely is the reason. But I still don't know what to do about it. I don't mind having fewer elements. How can i change its expectations? — durbachit, Dec 15 '16 at 05:59
I guess you could add another slice on the end to chop it off, but I don't really understand what you're trying to do. It's unclear how your code snippet relates to your description of the problem. It would help if you could provide a simple, self-contained example including sample data. — BrenBarn, Dec 15 '16 at 06:06
What does "mean x" mean? Do you mean that you want the data in the result row "260" to be the mean of all data in the input that was between 260 and 261? — BrenBarn, Dec 15 '16 at 06:59
In that case why not just use `groupby` to group by the integer value? — BrenBarn, Dec 15 '16 at 07:29
Well, I would prefer a more robust solution, which would work for any sompling (e.g. if I wanted to sample by 0.5 and not a whole integer). But for this particular case, it should work. If you add it as an answer, i will gladly accept it ;) — durbachit, Dec 16 '16 at 02:56

Subsampling pandas dataframe with float index - ValueError

0 Answers0