0

In my Spark application I have a dataframe with informations like

+------------------+---------------+
|      labels      | labels_values |
+------------------+---------------+
| ['l1','l2','l3'] |           000 |
| ['l3','l4','l5'] |           100 |
+------------------+---------------+

What I am trying to achieve is to create, given a label name as input a single_label_value column that takes the value for that label from the labels_values column.

For example, for label='l3' I would like to retrieve this output:

+------------------+---------------+--------------------+
|      labels      | labels_values | single_label_value |
+------------------+---------------+--------------------+
| ['l1','l2','l3'] |           000 |                  0 |
| ['l3','l4','l5'] |           100 |                  1 |
+------------------+---------------+--------------------+

Here's what I am attempting to use:

selected_label='l3'
label_position = F.array_position(my_df.labels, selected_label)
my_df= my_df.withColumn(
    "single_label_value", 
    F.substring(my_df.labels_values, label_position, 1)
)

But I am getting an error because the substring function does not like the label_position argument.

Is there any way to combine these function outputs without writing an udf?

Vektor88
  • 4,841
  • 11
  • 59
  • 111
  • Can you explain the column single lebel value, how this 0 value is coming in first row? What exactly you are trying to do with substring here? – Manu Gupta Mar 18 '20 at 12:43
  • @ManuGupta given a column containing an array and a column containing a string, I want to check the position of an element in the array and use it to extract the nth character from the string. – Vektor88 Mar 18 '20 at 12:48
  • Understood, The values, you mentioned was not exactly giving this location idea, since in first row index of L3 is 2 and in second row it is 0. The values, you are showing are o and 1. Let me work on solution – Manu Gupta Mar 18 '20 at 12:55
  • use SQL syntax of the function, check this faq: https://stackoverflow.com/questions/51140470/using-a-column-value-as-a-parameter-to-a-spark-dataframe-function – jxc Mar 18 '20 at 13:56

2 Answers2

1

Hope, this will work for you.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark=SparkSession.builder.getOrCreate()
mydata=[[['l1','l2','l3'],'000'], [['l3','l4','l5'],'100']]

df = spark.createDataFrame(mydata,schema=["lebels","lebel_values"]) 

selected_label='l3'

df2=df.select(
        "*",
        (array_position(df.lebels,selected_label)-1).alias("pos_val"))

df2.createOrReplaceTempView("temp_table")

df3=spark.sql("select *,substring(lebel_values,pos_val,1) as val_pos from temp_table")

df3.show()


+------------+------------+-------+-------+
|      lebels|lebel_values|pos_val|val_pos|
+------------+------------+-------+-------+
|[l1, l2, l3]|         000|      2|      0|
|[l3, l4, l5]|         100|      0|      1|
+------------+------------+-------+-------+

This is giving location of the value. If you want exact index then you can use -1 from this value.

--Edited anser -> Worked with temp view. Still looking for solution using withColumn option. I hope, it will help you for now. Edit2 -> Answer using dataframe.

df2=df.select(
        "*",
        (array_position(df.lebels,selected_label)-1).astype("int").alias("pos_val")
        )

df3=df2.withColumn("asked_col",expr("substring(lebel_values,pos_val,1)"))
df3.show()
Manu Gupta
  • 820
  • 6
  • 20
0

Try maybe:

import pyspark.sql.functions as f
from pyspark.sql.functions import *

selected_label='l3'
df=df.withColumn('single_label_value', f.substring(f.col('labels_values'), array_position(f.col('labels'), lit(selected_label))-1, 1))

df.show()

(for spark version >=2.4)

I think lit() was the function you were missing - you can use it to pass constant values to spark dataframes.

Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34