1

I have the following data frame (called df) with columns item_name and item_level:

   item_name    item_level 
----------------------------
   Item1              1
   Item2              2
   Item3              2
   Item4              3

I would like to create a new column which produces indentdation of the items, depending on their level. To do that, I would like to multiply the item_level by the string '---', with the idea that when I do that the string gets concatenated with itself as many times as the value of the integer I am multiplying the string with.

My desired result is something like this:

  item_name    item_level      new_column
------------------------------------------------
   Item1            1            ---Item1
   Item2            2            ------Item2
   Item3            2            ------Item3
   Item4            3            ---------Item4

In pyspark when I write the following command, the created column contains only null values:

from pyspark.sql import functions as F
df = df.withColumn('new_column',F.concat(F.lit(df.item_level*'---'),df.item_name))

The null values seem to come from the multiplication of the integers with the string. The concat function seems to work properly. For instance, the following works:

df = df.withColumn('new_column',F.concat(df.item_name,df.item_name))

I also tried few other things. If I use a constant number to multiply the string, the resulting string is displayed as wished:

number = 3
df = df.withColumn('new_column', F.lit(number*'---'))

Furthermore, adding the '---' string first in a column (with identical rows '---'), and then multiplying that column with the item_level column gives null values as well:

df = df.withColumn('padding',F.lit('---'))
df = df.withColumn('test',df.padding*df.item_name)

If I use pandas, however, this last piece of code does what I want. But I need to do this in pyspark.

pault
  • 41,343
  • 15
  • 107
  • 149

1 Answers1

1

There is a function pyspark.sql.functions.repeat that:

Repeats a string column n times, and returns it as a new string column.

Concatenate the result of repeat with the item_name as you were doing in your code. The only wrinkle is that you need to use pyspark.sql.functions.expr in order to pass a column value as an argument to a spark function.

from pyspark.sql.functions import concat, expr

df.withColumn(
    "new_column", 
    concat(expr("repeat('---', item_level)"), "item_name")
).show()
#+---------+----------+--------------+
#|item_name|item_level|    new_column|
#+---------+----------+--------------+
#|    Item1|         1|      ---Item1|
#|    Item2|         2|   ------Item2|
#|    Item3|         2|   ------Item3|
#|    Item4|         3|---------Item4|
#+---------+----------+--------------+

Note that show() will right justify the output that is displayed, but the underlying data is as you desired.

pault
  • 41,343
  • 15
  • 107
  • 149
  • Thanks so much! This actually does the job! I was struggling so much to find the right way, and this is perfect! – Irena Kuzmanovska Mar 07 '19 at 09:28
  • Instead of a string like '---' I want to have new_column be a long string of repated "item_name" but when i replace "---" with item_name it gives an error? – Garglesoap Nov 25 '20 at 17:42