Python Pandas Convert Series to Dataframe with Column Names & Values from Within Series

Question

TLDR; iterate through dataframe to pull column name, then value into a new dataframe

I have a long dataframe in the format

0 column 1
1 value
2 column 2
3 value
4 column 3
5 value

Repeating for 32 columns over about 100,000 rows.

I would like to iterate over this data frame to create a new data frame with:

Column 1 Column 2 Column 3
Value    Value     Value

I have tried converting to a dictionary to enable matching on the keys but have failed.

mozway · Answer 1 · 2023-05-09T10:19:05.740

Assuming "col" your column, use simple slicing:

cols = df.loc[df.index[::2], 'col'].to_numpy()
vals = df.loc[df.index[1::2], 'col'].to_numpy()

out = pd.DataFrame([vals], columns=cols)

Output:

  column 1 column 2 column 3
0    value    value    value

duplicated values

Considering an alternative example:

        col
0  column 1
1   value 1
2  column 2
3   value 2
4  column 2  # this name already exists
5   value 3

If you have duplicated column names, then a pivot might be useful:

mask = np.arange(len(df))%2 == 0

out = (df[mask]
 .assign(values=df['col'].shift(-1),
         idx=lambda d: d.groupby('col').cumcount()
         )
 .pivot(index='idx', columns='col', values='values')
 .rename_axis(index=None, columns=None)
)

Output:

  column 1 column 2
0  value 1  value 2
1      NaN  value 3

using numpy reshaping

If the column names are always in a logical order (1, 2, 3,…, 32, 1, 2, 3, …), then reshape would be a good alternative:

df = pd.DataFrame({'col': [f'{c}{i%32+1}' if c=='column' else f'{c}{i+1}'
                           for i in range(50_000) for c in ['column', 'value']]})

N = 32

a = df['col'].to_numpy()
values = a[1::2]

out = pd.DataFrame(np.pad(values, (0, len(values)-len(values)//N*N),
                          constant_values=np.nan).reshape((-1, N)),
                   columns=a[:2*N:2])

NB. thank you @Nick for pointing that out.

With 100000 rows there will be some repetition of column names and values will need slicing into multiple lists — Nick, May 09 '23 at 07:29
@Nick that's a possibility (OP should clarify that), I added an alternative to handle such case — mozway, May 09 '23 at 10:05
OP does actually say, "Repeating for 32 columns over about 100,000 rows." which I presume to mean there are about 1500 sets of values for each of the 32 columns. — Nick, May 09 '23 at 10:06
@Nick Then a `reshape` would be even better, if the pattern is consistent. (That's why it's always annoying not to have a full example… always a waste of time!) — mozway, May 09 '23 at 10:08

Nick · Answer 2 · 2023-05-09T08:09:53.670

Since you have 100,000 rows, there will be repetition of column names and also values will need to be grouped into rows. You can extract the unique column names, then chunk the values according to how many column names there are. For example:

df = pd.DataFrame({'col': [
    'column 1', 'value 1', 'column 2', 'value 2', 'column 3', 'value 3',
    'column 1', 'value 4', 'column 2', 'value 5', 'column 3', 'value 6']
})

cols = np.unique(df['col'].iloc[::2])
values = [list(df['col'].iloc[1::2][i:i+len(cols)]) for i in range(0, len(df) // 2, len(cols))]
out = pd.DataFrame(values, columns=cols)

Sample output:

  column 1 column 2 column 3
0  value 1  value 2  value 3
1  value 4  value 5  value 6

Note I've chosen a simplistic way to chunk the values for the purposes of the sample code. For best performance you will probably want to use one of the methods described in this Q&A.

@jezrael this answer will work for any number of columns over any number of rows. I just chose a smaller example to demonstrate — Nick, May 09 '23 at 08:07

Ferris · Answer 3 · 2023-05-10T08:11:40.500

With 100000 rows there will be some repetition of column names and values will need slicing into multiple lists, Then use 32 columns may not be a proper way to show the data.

Repeating for 32 columns over about 100,000 rows.

I would like to iterate over this data frame to create a new data frame with:

Column 1 Column 2 Column 3
Value    Value     Value
I have tried converting to a dictionary to enable matching on the keys but have failed

Maybe you can use iloc[::2] index to iterate the values, then use zip to combine the columns name and value, then convert to dataframe.

Following is the demo code.

import io
import pandas as pd

df_str = '''
_id  _val
0 column1
1 value
2 column2
3 value
4 column3
5 value
'''
df = pd.read_csv(io.StringIO(df_str.strip()), sep='\s+')
print(df)

# then
dfn = pd.DataFrame(zip(df['_val'].iloc[::2], df['_val'].iloc[1::2]), columns=['col_name', 'value'])
print(dfn)

output

   _id     _val
0    0  column1
1    1    value
2    2  column2
3    3    value
4    4  column3
5    5    value

    col_name    value
0   column1 value
1   column2 value
2   column3 value

With 100000 rows there will be some repetition of column names and values will need slicing into multiple lists — Nick, May 09 '23 at 07:29

Python Pandas Convert Series to Dataframe with Column Names & Values from Within Series

3 Answers3

duplicated values

using numpy reshaping