2

I have a data frame which looks like this:

Filename        Type
file1.A.txt 
file2.A.txt 
file3.B.txt 
file4.A.txt
file5.B.txt
...

I want to add another column, Type, which will depend on the filename. If there is an A in the filename, add A, if there is a B, add B.

I've seen something vaguely similar to this in Add column to Data Frame conditionally in Pyspark but I can't see how I can apply this in my case.

I can add constants to Spark by df = df.withColumn('NewCol', lit('a')) but how can I alter this expression, using regex, to add a certain string in some cases, and another string in other cases?

This is similar to the linked question Spark Equivalent of IF Then ELSE but Michael West's answer is easier to type out and more specific to the problem. However, I think it could still solve the problem (though would be more difficult to read).

con
  • 5,767
  • 8
  • 33
  • 62
  • 1
    See the linked duplicate, but you essentially want something like `df = df.withColumn("NewCol", when(col("Filename").like("%A%"), lit('a')).when(col("Filename").like("%B%"), lit('b')))` – pault Sep 18 '18 at 21:15
  • thanks very much for your answer @pault I am slowly learning Spark. I accepted Michael West's answer because it was shorter and more clearly addressed my narrow scope here, but then when/like is interesting too, I will look into that to learn it better! – con Sep 18 '18 at 21:38

2 Answers2

3

Something like this should work

from pyspark.sql.functions import regexp_extract

df = spark.createDataFrame([
    ("file1.A.txt",),
    ("file2.A.txt",),
    ("file3.B.txt",),
    ("file4.A.txt",),
    ("file5.B.txt",)
  ],
  ["filenames"]
)

df.withColumn('A_or_B', regexp_extract('filenames', '^\w+\.(.*)\.txt$', 1)).show()

+-----------+------+
|  filenames|A_or_B|
+-----------+------+
|file1.A.txt|     A|
|file2.A.txt|     A|
|file3.B.txt|     B|
|file4.A.txt|     A|
|file5.B.txt|     B|
+-----------+------+
Michael West
  • 1,636
  • 16
  • 23
1

Alternate answer, less general. Split the filename into an array based on . delimiter

from pyspark.sql.functions import split, col

df = spark.createDataFrame([
    ("file1.A.txt",),
    ("file2.A.txt",),
    ("file3.B.txt",),
    ("file4.A.txt",),
    ("file5.B.txt",)
  ],
  ["filenames"]
)

df.withColumn('a_or_b', split(col("filenames"), "\.")[1]).show()

+-----------+------+
|  filenames|a_or_b|
+-----------+------+
|file1.A.txt|     A|
|file2.A.txt|     A|
|file3.B.txt|     B|
|file4.A.txt|     A|
|file5.B.txt|     B|
+-----------+------+
Michael West
  • 1,636
  • 16
  • 23