For example, the data is
customer = spark.createDataFrame([
(0, "Bill Chambers"),
(1, "Matei Zaharia"),
(2, "Michael Armbrust")])\
.toDF("customerid", "name")
order = spark.createDataFrame([
(0, 0, "Product 0"),
(1, 1, "Product 1"),
(2, 1, "Product 2"),
(3, 3, "Product 3"),
(4, 1, "Product 4")])\
.toDF("orderid", "customerid", "product_name")
to get customer with order, I can do it with left semi
customer.join(order, ['customerid'], "left_semi").show()
It can return
Now for comparison reason, I want to add a flag column instead of directly filtering out some rows. The desired output would look like this:
+----------+----------------+---------+
|customerid| name|has_order|
+----------+----------------+---------+
| 0| Bill Chambers | true|
| 1| Matei Zaharia | true|
| 2|Michael Armbrust| false|
+----------+----------------+---------+
How can I do it? Is there any elegant way to do it? I tried to search but didn't find things related, maybe I get the wrong key words?
Possible to do it with SQL's exist/in?: Spark replacement for EXISTS and IN