0

To explain what I mean, let's use the following example:

------------------------
|A     |  B      |  C  |
------------------------
|JAVA  |    2    |  1  |
------------------------
|JAVA  |    40   |  22 |
------------------------
|JAVA  |    40   |  52 |
------------------------
|JAVA  |    22   |  7  |
------------------------
|PYT   |    7    |  99 |
------------------------
|C++   |    3    |  5  |
------------------------

The goal is to obtain like this:

|JAVA  |
--------
|2     |
--------
|40    |
--------
|40    |
--------
|22    |
--------

In text, I want to get the duplicated values in a column as the name of a column in a new dataframe and its values are the corresponding values in the same row in another column of the old dataframe, hope I explained well. If any can help using python, I will appreciate it. Thanks

blackbishop
  • 30,945
  • 11
  • 55
  • 76
Salxprog
  • 11
  • 4

2 Answers2

0

You can read each row and slice them to columns by using DataFrame.iterrows() (which is a generator). Then make a dict, which its keys are values of A column and its Values are list of their correspond values on B column.

I think you need something like this:

from collections import defaultdict
import pandas as pd

original_columns = {
    'A': ["JAVA", "JAVA", "JAVA", "JAVA", "PYT", "C++"],
    'B': ["2", "40", "40", "22", "7", "3"],
    'C': ["1", "22", "52", "7", "99", "5"]
}

original_data_frame = pd.DataFrame(original_columns, columns=["A", "B", "C"])
new_columns = defaultdict(list)
for index, each_row in original_data_frame.iterrows():
    a_row = each_row["A"]
    b_row = each_row["B"]
    c_row = each_row["C"]
    new_columns[a_row].append(b_row)

print(dict(new_columns))

Some credits to @waitingkuo, @carlos-mougan and this question https://stackoverflow.com/users/1426056/waitingkuo

DRPK
  • 2,023
  • 1
  • 14
  • 27
  • The code worked well, but when it's about a large dataframe it's the bad code ever. – Salxprog Jan 24 '21 at 06:30
  • @Salxprog: what do you mean of bad code ever? did you get some errors ? update your answer or say in comment; i will check it – DRPK Jan 24 '21 at 06:36
  • Bad code means the execution takes a lot of time because of iterating each row. – Salxprog Jan 24 '21 at 18:51
0

You can filter the rows with duplicate values and change the column name to the value in column A:

from pyspark.sql import functions as F, Window

df2 = (df.withColumn('count', F.count('A').over(Window.partitionBy('A')))
         .filter('count > 1')
         .select(F.col('B').alias(df.select('A').head()[0]))
      )

df2.show()
+----+
|JAVA|
+----+
|   2|
|  40|
|  40|
|  22|
+----+
mck
  • 40,932
  • 13
  • 35
  • 50