Make a duplicate value as name of a column in new dataframe with rows as the corresponding values in an other column

Question

To explain what I mean, let's use the following example:

------------------------
|A     |  B      |  C  |
------------------------
|JAVA  |    2    |  1  |
------------------------
|JAVA  |    40   |  22 |
------------------------
|JAVA  |    40   |  52 |
------------------------
|JAVA  |    22   |  7  |
------------------------
|PYT   |    7    |  99 |
------------------------
|C++   |    3    |  5  |
------------------------

The goal is to obtain like this:

|JAVA  |
--------
|2     |
--------
|40    |
--------
|40    |
--------
|22    |
--------

In text, I want to get the duplicated values in a column as the name of a column in a new dataframe and its values are the corresponding values in the same row in another column of the old dataframe, hope I explained well. If any can help using python, I will appreciate it. Thanks

score 0 · Accepted Answer · answered Jan 24 '21 at 04:35

You can read each row and slice them to columns by using DataFrame.iterrows() (which is a generator). Then make a dict, which its keys are values of A column and its Values are list of their correspond values on B column.

I think you need something like this:

from collections import defaultdict
import pandas as pd

original_columns = {
    'A': ["JAVA", "JAVA", "JAVA", "JAVA", "PYT", "C++"],
    'B': ["2", "40", "40", "22", "7", "3"],
    'C': ["1", "22", "52", "7", "99", "5"]
}

original_data_frame = pd.DataFrame(original_columns, columns=["A", "B", "C"])
new_columns = defaultdict(list)
for index, each_row in original_data_frame.iterrows():
    a_row = each_row["A"]
    b_row = each_row["B"]
    c_row = each_row["C"]
    new_columns[a_row].append(b_row)

print(dict(new_columns))

Some credits to @waitingkuo, @carlos-mougan and this question https://stackoverflow.com/users/1426056/waitingkuo

The code worked well, but when it's about a large dataframe it's the bad code ever. — Salxprog, Jan 24 '21 at 06:30
@Salxprog: what do you mean of bad code ever? did you get some errors ? update your answer or say in comment; i will check it — DRPK, Jan 24 '21 at 06:36
Bad code means the execution takes a lot of time because of iterating each row. — Salxprog, Jan 24 '21 at 18:51

score 0 · Answer 2 · answered Jan 24 '21 at 07:27

You can filter the rows with duplicate values and change the column name to the value in column A:

from pyspark.sql import functions as F, Window

df2 = (df.withColumn('count', F.count('A').over(Window.partitionBy('A')))
         .filter('count > 1')
         .select(F.col('B').alias(df.select('A').head()[0]))
      )

df2.show()
+----+
|JAVA|
+----+
|   2|
|  40|
|  40|
|  22|
+----+

Make a duplicate value as name of a column in new dataframe with rows as the corresponding values in an other column

2 Answers2