Splitting pandas series into multiple columns

Question

I'm extracting large amounts of entries from a database (> 15 millions entries) with the goal of exporting to a CSV file. What I get at the end of my request is a one-column dataframe with over 15 millions rows. I'm looking for a function to split the entries into multiple columns every million entries.

So for a 5 million entries column, I would like to have 5 columns of 1 million entries each.

Thanks in advance!

You can `reshape` the df values: `pd.DataFame(df.values.reshape(1000000,5))` — EdChum, May 04 '16 at 14:22
This might be an interesting answer for your problem: http://stackoverflow.com/a/22127685/2849552 — Joris, May 04 '16 at 14:28

wflynny · Accepted Answer · 2016-05-05T00:47:32.243

I agree with @EdChum that this would be simplest given a Series object named s:

d = pd.DataFrame(s.values.reshape(1000000, -1))

which would reshape your Series into a DataFrame of shape (1,000,000, s.len / 1,000,000).

However, the above only works if you have a series whose length is an exact multiple of 1,000,000. Alternatively, you could do something like:

    # note with python3, you need to use integer division // here
    s.index = pd.MultiIndex.from_tuples([(x/1000000,x%1000000) for x in s.index])
    # or an alternative below which does the same thing
    #s.index = pd.MultiIndex.from_tuples(s.index.map(lambda x: (x/1000000, x%1000000)))
    s.unstack(0)

which will give you several columns of the same length with the last column padded with NaNs.

Here's an example with a Series of length 55 which I want split into columns of length 10. Note the last column has the last 5 values set to NaN:

In [42]: s = pd.Series(np.arange(55))

In [43]: s
Out[43]: 
0      0
1      1
2      2
...
53    53
54    54
dtype: int64

#                                                      with python3 x//10, x%10
In [44]: s.index = pd.MultiIndex.from_tuples(s.index.map(lambda x: (x/10, x%10)))

In [45]: s.unstack(0)
Out[45]: 
   0   1   2   3   4   5
0  0  10  20  30  40  50
1  1  11  21  31  41  51
2  2  12  22  32  42  52
3  3  13  23  33  43  53
4  4  14  24  34  44  54
5  5  15  25  35  45 NaN
6  6  16  26  36  46 NaN
7  7  17  27  37  47 NaN
8  8  18  28  38  48 NaN
9  9  19  29  39  49 NaN

Note two things:

that using s.index.map(lambda ...) should be faster than the list comprehension for very large arrays.
if using python3, make sure to user integer division in the lambda function: lambda x: (x // N, x % N).

I tested it and I do not get the same output as mentioned. Instead of having 1,2,3, etc. I have 0.0, 0.1, 0.2 for the columns. Am I missing something? — John_Mtl, May 04 '16 at 18:20
Which method did you test? Does your series index start at 0 and increase to len(series)-1? — wflynny, May 04 '16 at 18:41
I used both methods you mentionned with the very example used above — John_Mtl, May 04 '16 at 23:16
@John_Mtl, are you using python3? If so, see the small edit above; you need to use integer division `//` in the lambda function. — wflynny, May 05 '16 at 00:45

score 0 · Answer 2 · answered May 04 '16 at 14:31

0

And the ugliest line of code award goes to....

x = 1000000
pd.concat([pd.DataFrame(np.array(df[df.columns[0]].tolist())[:-(len(df)%x )].reshape(len(df)//x, x)), pd.DataFrame(df[df.columns[0]].tolist()[len(df) - len(df)%x:])] , axis=1)

You should be set for any value of x. No doubt stuff can be 100% prettier, was just messing around with ipython ;)

answered May 04 '16 at 14:31

PdevG

3,427
15
30

I get a memory error from Python trying this one. Guess it is eating too much! – John_Mtl May 04 '16 at 18:19

Splitting pandas series into multiple columns

2 Answers2