2

I've just started using Python and I'm stuck with a problem related to a dataset I'm working with.

I have the following dataset:

    C1  C2  C3          C4      C5      C6
    99  069 99002068    3348117 3230802 T6
    99  069 99002063    4599974 178885  T4
    99  069 99002063    4599974 4606066 T4
    99  069 99002063    4599974 236346  T4
    99  069 99002063    4599974 310114  T4

I need to group by transpose column C5 into multiple columns based on a group by of columns C1,C2,C3,C4,C6.

The code I've written so far is the following:

    # load plugins
    import pandas as pd

    # import CSV
    data = pd.read_csv(
        "C:/Users/mcatuogno/Desktop/lista_collegamenti_onb.csv",
        sep=";",
        header=None,
        dtype=str,
        usecols=[0, 1, 2, 3, 4, 5],
        names=["C1", "C2", "C3", "C4", "C5", "C6"]
    )

    # sort values
    dataSort = data.sort_values(["C1", "C2", "C3", "C4"])

    # transpose column based on group by function
    dataTranspose = dataSort.groupby(["C1", "C2", "C3", "C4", "C6"])["C5"].apply(list)

With the code above the result is

    C1   C2  ...              C6      C5
    99  000  ...  09900000001100      [102995, 102997, 102996]
    99  000  ...  09900000001135      [103042]

I don't know how I can split the column C5 into multiple columns, each with the following name CN_1, CN_2, ..., CN_x.

Which python function can I use?

Thanks in advance!

  • Possible duplicate of [Pandas split column of lists into multiple columns](https://stackoverflow.com/questions/35491274/pandas-split-column-of-lists-into-multiple-columns) – Zaraki Kenpachi Jul 10 '19 at 13:01

1 Answers1

2

You can create helper Series for count consecutive values per groups by GroupBy.cumcount, add to MultiIndex and reshape by Series.unstack:

g = dataSort.groupby(["C1", "C2", "C3", "C4", "C6"])["C5"].cumcount()
print (g)
1    0
2    1
3    2
4    3
0    0
dtype: int64

df = (dataSort.set_index(["C1", "C2", "C3", "C4", "C6", g])['C5']
              .unstack()
              .add_prefix('Cn_')
              .reset_index())
print (df)
   C1  C2        C3       C4  C6       Cn_0       Cn_1      Cn_2      Cn_3
0  99  69  99002063  4599974  T4   178885.0  4606066.0  236346.0  310114.0
1  99  69  99002068  3348117  T6  3230802.0        NaN       NaN       NaN

Your solution should be changed for create new DataFrame per constructor:

dataTranspose = dataSort.groupby(["C1", "C2", "C3", "C4", "C6"])["C5"].apply(list)

df = (pd.DataFrame(dataTranspose.values.tolist(), index = dataTranspose.index)
        .add_prefix('Cn_')
        .reset_index())
print (df)
   C1  C2        C3       C4  C6     Cn_0       Cn_1      Cn_2      Cn_3
0  99  69  99002063  4599974  T4   178885  4606066.0  236346.0  310114.0
1  99  69  99002068  3348117  T6  3230802        NaN       NaN       NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252