0

I have a dataframe in pyspark like this :

+--------------------+--------+----------+--------------------+--------------------+
|               title| journal|      date|              author|             content|
+--------------------+--------+----------+--------------------+--------------------+
|Kudlow Breaks Wit...|NYT     |2019-05-01|    By Mark Landler |WASHINGTON — Pres...|
|Scrutiny of Russi...|NYT     |2019-05-01|By Charlie Savage...|WASHINGTON — The ...|
|Greek Anarchists ...|NYP     |2019-05-01|By Niki Kitsantonis |ATHENS — Greek an...|

I'm looking for replace row where journal equal to "NYP". I know how to proceed with sql context :

df.createOrReplaceTempView("tbl_journal")
df = sqlContext.sql("SELECT journal, date FROM tbl_journal where journal like '%NYT%'")
df = df.withColumn('journal', lit('The New York Times'))

But the problem is that it will rewrite on the original dataframe (I just want to replace the values where journal = 'NYT' and keep the other values).

Other thing, I search on other topics but i don't find solution in order to combine a Where and WithColumn statement. I mean if i do that in PySpark (not with SQL):

df.where(col('journal').like("%NYT%")).withColumn('journal', lit('Oui Test')).show()

It replace all the values, there is no condition.

Do you know how to replace only the values with this condition, in the original dataframe ? With spark or sqlcontext. Thanks for advance !

Mouss1995
  • 33
  • 1
  • 7

1 Answers1

1

Use when-otherwise to populate values conditionally-

from pyspark.sql.functions import when
df = df.withColumn('journal', when(df.journal.like('%NYT%'), 'The New York Times').otherwise(df.journal))
SunilG
  • 347
  • 1
  • 4
  • 10
  • Thanks a lot for your help. I have an error of invalid syntax at the "like" statement. I import all the sql functions with pyspark. I don't know if there some things to import with pyspark in order to use this statement ? – Mouss1995 Jun 04 '21 at 09:00
  • Try the edited answer – SunilG Jun 04 '21 at 09:05
  • Ok i have an other error, it's "TypeError: condition should be a Column". I think it's about the sentence "journal like '%NYT%' no ? – Mouss1995 Jun 04 '21 at 09:15
  • Try the updated one, it works for me. – SunilG Jun 04 '21 at 09:28
  • Thanks a lot it work perfectly ! I didn't know the When Otherwise statement but it work perfectly. Have a nice day :) – Mouss1995 Jun 04 '21 at 09:46