2

is there any way to create/fill columns with pyspark 2.1.0 where the name of the column is the value of a different column? I tried the following

def createNewColumnsFromValues(dataFrame, colName, targetColName):
  """
  Set value of column colName to targetColName's value
  """
  cols = dataFrame.columns
  #df = dataFrame.withColumn(f.col(colName), f.col(targetColName))
  df = dataFrame.withColumn('x', f.col(targetColName))
  return df

The out commented line does not work, when calling the method I get the error

TypeError: 'Column' object is not callable

whereas the fixed name (as a string) is no problem. Any idea of how to also make the name of the column come from another one, not just the value? I also tried to use a UDF function definition as a workaround with the same no success result.

Thanks for help!

Edit:

from pyspark.sql import functions as f
aynber
  • 22,380
  • 8
  • 50
  • 63
gilgamash
  • 862
  • 10
  • 31
  • Can you also provide the part with the call of the function? I guess it also depends on the arguments you pass to the function, since passing a simple string as `colName` should be possible. The commented line could also be rewritten as `...withColumn(colName, f.col(targetCol))` since colName should contain the name of the new column – gaw Sep 13 '18 at 10:32
  • Hi gaw. colName is a string, using it as the first parameter of withColumn would be of no help, as I need the respective entry of the column "colName", not a fix name, which colName would be. Using colName would basically be similar to the not commented line. – gilgamash Sep 13 '18 at 10:40
  • It would be very helpful if you provided a [small reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and your desired output. – pault Sep 13 '18 at 14:00

2 Answers2

0

You might want to try the following code:

test_df = spark.createDataFrame([
    (1,"2",5,1),(3,"4",7,8),              
    ], ("col1","col2","col3","col4"))

def createNewColumnsFromValues(dataFrame, sourceCol, colName, targetCol):
    """
    Set value column colName to targetCol
    """
    for value in sourceCol:
        dataFrame = dataFrame.withColumn(str(value[0]), when(col(colName)==value[0], targetCol).otherwise(None))

    return dataFrame

createNewColumnsFromValues(test_df, test_df.select("col4").collect(), "col4", test_df["col3"]).show()   

The trick here is to do select("COLUMNNAME").collect() to get a list of the values in the column. Then colName contains this list, which is a list of rows, where each row has a single element. So you can directly iterate through the list and access the element at position 0. In this case a cast to string was necessary to ensure the column name of the new column is a string. The target column is used for the values for each of the individual columns. So the result would look like:

+----+----+----+----+----+----+
|col1|col2|col3|col4|   1|   8|
+----+----+----+----+----+----+
|   1|   2|   5|   1|   5|null|
|   3|   4|   7|   8|null|   7|
+----+----+----+----+----+----+
gaw
  • 1,960
  • 2
  • 14
  • 18
  • Hi again and thanks for your reply. The problem with this approach is, that each created column gets the value from targetCol, which is not suitable in my situation, where the columns not referenced in the name should be NONE. However, I should have mentioned that in my question. – gilgamash Sep 13 '18 at 11:53
  • I will edit my answer and now it should better match your question. In this case you have to give the initial columnName as an argument and modify the `withColumn` command to only take the value from the target, if the current value matches the correct item in the corresponding line of the sourceColumn – gaw Sep 13 '18 at 11:57
  • Thx again, I will give it a shot and try putting things at least into UDFs, as scalability seems to be an issue. – gilgamash Sep 13 '18 at 12:23
  • Well, I guess this is just not applicable given the scalability problems. This needs to work well on a cluster an on a table with several (hundreds of) millions of rows. Thanks nevertheless! – gilgamash Sep 13 '18 at 12:32
  • 1
    I used the problem meanwhile! Becomes scalable when not iterating over the rows but over the distinct entries I want columns for, as those are not too many. – gilgamash Sep 13 '18 at 13:47
0

I figured a solution which scales nicely for few (or not many) distinct values I need columns for. Which is necessarily the case or the number of columns would explode.

def createNewColumnsFromValues(dataFrame, colName, targetCol):
  distinctValues = dataFrame.select(colName).distinct().collect()
  for value in distinctValues:
    dataFrame = dataFrame.withColumn(str(value[0]), f.when(f.col(colName) == value[0], f.col(targetCol)).otherwise(f.lit(None)))

return dataFrame
gilgamash
  • 862
  • 10
  • 31