0

TLDR; iterate through dataframe to pull column name, then value into a new dataframe

I have a long dataframe in the format

0 column 1
1 value
2 column 2
3 value
4 column 3
5 value

Repeating for 32 columns over about 100,000 rows.

I would like to iterate over this data frame to create a new data frame with:

Column 1 Column 2 Column 3
Value    Value     Value

I have tried converting to a dictionary to enable matching on the keys but have failed.

dm2
  • 4,053
  • 3
  • 17
  • 28

3 Answers3

1

Assuming "col" your column, use simple slicing:

cols = df.loc[df.index[::2], 'col'].to_numpy()
vals = df.loc[df.index[1::2], 'col'].to_numpy()

out = pd.DataFrame([vals], columns=cols)

Output:

  column 1 column 2 column 3
0    value    value    value

duplicated values

Considering an alternative example:

        col
0  column 1
1   value 1
2  column 2
3   value 2
4  column 2  # this name already exists
5   value 3

If you have duplicated column names, then a pivot might be useful:

mask = np.arange(len(df))%2 == 0

out = (df[mask]
 .assign(values=df['col'].shift(-1),
         idx=lambda d: d.groupby('col').cumcount()
         )
 .pivot(index='idx', columns='col', values='values')
 .rename_axis(index=None, columns=None)
)

Output:

  column 1 column 2
0  value 1  value 2
1      NaN  value 3

using reshaping

If the column names are always in a logical order (1, 2, 3,…, 32, 1, 2, 3, …), then reshape would be a good alternative:

df = pd.DataFrame({'col': [f'{c}{i%32+1}' if c=='column' else f'{c}{i+1}'
                           for i in range(50_000) for c in ['column', 'value']]})

N = 32

a = df['col'].to_numpy()
values = a[1::2]

out = pd.DataFrame(np.pad(values, (0, len(values)-len(values)//N*N),
                          constant_values=np.nan).reshape((-1, N)),
                   columns=a[:2*N:2])

NB. thank you @Nick for pointing that out.

mozway
  • 194,879
  • 13
  • 39
  • 75
  • With 100000 rows there will be some repetition of column names and values will need slicing into multiple lists – Nick May 09 '23 at 07:29
  • @Nick that's a possibility (OP should clarify that), I added an alternative to handle such case – mozway May 09 '23 at 10:05
  • OP does actually say, "Repeating for 32 columns over about 100,000 rows." which I presume to mean there are about 1500 sets of values for each of the 32 columns. – Nick May 09 '23 at 10:06
  • @Nick Then a `reshape` would be even better, if the pattern is consistent. (That's why it's always annoying not to have a full example… always a waste of time!) – mozway May 09 '23 at 10:08
  • Indeed, reshape could work very well for this problem. – Nick May 09 '23 at 10:09
1

Since you have 100,000 rows, there will be repetition of column names and also values will need to be grouped into rows. You can extract the unique column names, then chunk the values according to how many column names there are. For example:

df = pd.DataFrame({'col': [
    'column 1', 'value 1', 'column 2', 'value 2', 'column 3', 'value 3',
    'column 1', 'value 4', 'column 2', 'value 5', 'column 3', 'value 6']
})

cols = np.unique(df['col'].iloc[::2])
values = [list(df['col'].iloc[1::2][i:i+len(cols)]) for i in range(0, len(df) // 2, len(cols))]
out = pd.DataFrame(values, columns=cols)

Sample output:

  column 1 column 2 column 3
0  value 1  value 2  value 3
1  value 4  value 5  value 6

Note I've chosen a simplistic way to chunk the values for the purposes of the sample code. For best performance you will probably want to use one of the methods described in this Q&A.

Nick
  • 138,499
  • 22
  • 57
  • 95
  • OP need `Repeating for 32 columns over about 100,000 rows.` – jezrael May 09 '23 at 08:06
  • 1
    @jezrael this answer will work for any number of columns over any number of rows. I just chose a smaller example to demonstrate – Nick May 09 '23 at 08:07
0

With 100000 rows there will be some repetition of column names and values will need slicing into multiple lists, Then use 32 columns may not be a proper way to show the data.

Repeating for 32 columns over about 100,000 rows.

I would like to iterate over this data frame to create a new data frame with:

Column 1 Column 2 Column 3
Value    Value     Value
I have tried converting to a dictionary to enable matching on the keys but have failed

Maybe you can use iloc[::2] index to iterate the values, then use zip to combine the columns name and value, then convert to dataframe.

Following is the demo code.

import io
import pandas as pd

df_str = '''
_id  _val
0 column1
1 value
2 column2
3 value
4 column3
5 value
'''
df = pd.read_csv(io.StringIO(df_str.strip()), sep='\s+')
print(df)

# then
dfn = pd.DataFrame(zip(df['_val'].iloc[::2], df['_val'].iloc[1::2]), columns=['col_name', 'value'])
print(dfn)

output

   _id     _val
0    0  column1
1    1    value
2    2  column2
3    3    value
4    4  column3
5    5    value

    col_name    value
0   column1 value
1   column2 value
2   column3 value
Ferris
  • 5,325
  • 1
  • 14
  • 23
  • With 100000 rows there will be some repetition of column names and values will need slicing into multiple lists – Nick May 09 '23 at 07:29