1

I have a text file with my data, formatted as a single list. The data is actually a number of rows and columns, but the format is as a single column. I have imported this into a pandas dataframe, and I would like to reshape this dataframe.

This is the format of the data as a list:

a1
b1
c1
d1
e1
a2
b2
c2
d2
e2
a3
b3
c3
d3
e3
etc...

The desired format is:

"Heading 1"    "Heading 2"    "Heading 3"    "Heading 4"    "Heading 5" 
a1             b1             c1             d1             e1
a2             b2             c2             d2             e2
a3             b3             c3             d3             e3

I have tried pandas stack and unstack functions, but no luck. I had also tried using a numpy array, but my data has numbers and strings in it, so this does not work well.

1 Answers1

1

You can create list of tuples first and pass to DataFrame constructor:

L = ['a1', 1, 'c1', 'd1', 'e1', 'a2', 2, 'c2', 'd2', 'e2', 'a3', 3, 'c3', 'd3', 'e3']

import itertools

#https://stackoverflow.com/a/1625013
def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

print (list(grouper(5, L)))
[('a1', 1, 'c1', 'd1', 'e1'), ('a2', 2, 'c2', 'd2', 'e2'), ('a3', 3, 'c3', 'd3', 'e3')]

df = pd.DataFrame(list(grouper(5, L))).rename(columns = lambda x: f'Heading {x + 1}')
print (df)
  Heading 1  Heading 2 Heading 3 Heading 4 Heading 5
0        a1          1        c1        d1        e1
1        a2          2        c2        d2        e2
2        a3          3        c3        d3        e3

print (df.dtypes)
Heading 1    object
Heading 2     int64
Heading 3    object
Heading 4    object
Heading 5    object
dtype: object

First idea with reshape, but last is necessary convert column to numeric:

df = pd.DataFrame(np.array(L).reshape(-1, 5)).rename(columns = lambda x: f'Heading {x + 1}')
print (df)
  Heading 1 Heading 2 Heading 3 Heading 4 Heading 5
0        a1         1        c1        d1        e1
1        a2         2        c2        d2        e2
2        a3         3        c3        d3        e3

print (df.dtypes)

Heading 1    object
Heading 2    object <- converted to object
Heading 3    object
Heading 4    object
Heading 5    object
dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks a lot, that does the job! Could you describe a bit what these two lines do, and why they are necessary? `code: args = [iter(iterable)] * n ` and `code: return itertools.zip_longest(*args, fillvalue=fillvalue) ` – Bernard Esterhuyse Oct 14 '19 at 08:20
  • @BernardE - Hard question, because not author of function. But second `itertools.zip_longest(*args, fillvalue=fillvalue)` means add `fillvalue` if number of values modulo by `5` is not 0 - here `15 % 5` is `0` so working nice. But if number of value is `14` or `13` for last or last 2 values are added `fillvalue` - here `None` – jezrael Oct 14 '19 at 08:33