Repeating a string based on a column value (like multiplication of a string and a number in python)

Question

I have the following data frame (called df) with columns item_name and item_level:

   item_name    item_level 
----------------------------
   Item1              1
   Item2              2
   Item3              2
   Item4              3

I would like to create a new column which produces indentdation of the items, depending on their level. To do that, I would like to multiply the item_level by the string '---', with the idea that when I do that the string gets concatenated with itself as many times as the value of the integer I am multiplying the string with.

My desired result is something like this:

  item_name    item_level      new_column
------------------------------------------------
   Item1            1            ---Item1
   Item2            2            ------Item2
   Item3            2            ------Item3
   Item4            3            ---------Item4

In pyspark when I write the following command, the created column contains only null values:

from pyspark.sql import functions as F
df = df.withColumn('new_column',F.concat(F.lit(df.item_level*'---'),df.item_name))

The null values seem to come from the multiplication of the integers with the string. The concat function seems to work properly. For instance, the following works:

df = df.withColumn('new_column',F.concat(df.item_name,df.item_name))

I also tried few other things. If I use a constant number to multiply the string, the resulting string is displayed as wished:

number = 3
df = df.withColumn('new_column', F.lit(number*'---'))

Furthermore, adding the '---' string first in a column (with identical rows '---'), and then multiplying that column with the item_level column gives null values as well:

df = df.withColumn('padding',F.lit('---'))
df = df.withColumn('test',df.padding*df.item_name)

If I use pandas, however, this last piece of code does what I want. But I need to do this in pyspark.

score 1 · Answer 1 · answered Mar 06 '19 at 15:51

There is a function pyspark.sql.functions.repeat that:

Repeats a string column n times, and returns it as a new string column.

Concatenate the result of repeat with the item_name as you were doing in your code. The only wrinkle is that you need to use pyspark.sql.functions.expr in order to pass a column value as an argument to a spark function.

from pyspark.sql.functions import concat, expr

df.withColumn(
    "new_column", 
    concat(expr("repeat('---', item_level)"), "item_name")
).show()
#+---------+----------+--------------+
#|item_name|item_level|    new_column|
#+---------+----------+--------------+
#|    Item1|         1|      ---Item1|
#|    Item2|         2|   ------Item2|
#|    Item3|         2|   ------Item3|
#|    Item4|         3|---------Item4|
#+---------+----------+--------------+

Note that show() will right justify the output that is displayed, but the underlying data is as you desired.

Thanks so much! This actually does the job! I was struggling so much to find the right way, and this is perfect! — Irena Kuzmanovska, Mar 07 '19 at 09:28
Instead of a string like '---' I want to have new_column be a long string of repated "item_name" but when i replace "---" with item_name it gives an error? — Garglesoap, Nov 25 '20 at 17:42

Repeating a string based on a column value (like multiplication of a string and a number in python)

1 Answers1