Create a new column from existing column. Put values that do not exist in the already existing column. Choose from a list of given values

Question

What I have -

I have a spark-sql dataframe(df), which has a column named "Country"+other columns also. df dataframe
I have a list of 24 countries.(Note: This can be converted into a dataframe or table, whatever can be feasible for the solution) listcountry = austria,india,russia,france,germany

What is my problem? I need to create a new column('remaining countries') in 'df' which will be having all the countries out of the given 24 countries in the list, except the one which is available in df.Country

Expected Solution: expected solution

Note: I prefer pyspark/spark-sql/sql in this solution.

Please provide enough code so others can better understand or reproduce the problem. — Community, Feb 23 '22 at 04:52

score 0 · Answer 1 · answered Feb 22 '22 at 15:36

0

If we combine the results from; How to select all records from one table that do not exist in another table? and How do I create a comma-separated list using a SQL query?, the sql result would be something like;

UPDATE df
SET remaining_countries =  (
        SELECT GROUP_CONCAT(t1.country SEPARATOR ',')
        FROM df t1
        LEFT JOIN df t2 ON t2.name = t1.name
        WHERE t2.name IS NULL
    )
where remaining_countries is null

However, since you're looking for spark, I recommend further reading here: Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function

answered Feb 22 '22 at 15:36

Brandon_R

46
5

Hi Brandon, 1. I am doing this on Azure Databricks notebook where SEAPARATOR doesn't seem to work. 2. I have my original table with given columns above and another table which has one column country having 24 rows with different countries. How does that work here – Aditi Singhal Feb 23 '22 at 06:53
The implication I got from the third link in my answer, is that `GROUP_CONCAT` itself wouldn't work, and that there were a few work arounds that I'm not sure what would be best for you. I've also found https://docs.databricks.com/sql/language-manual/functions/concat_ws.html that may work as `CONCAT_WS(',', t1.country)`. The links you provided were labeled as the current table, and the table you wanted at the end. So the code I provided assumed you had already created the column df.remaining_countries, and just needed it populated. – Brandon_R Feb 24 '22 at 14:37

Create a new column from existing column. Put values that do not exist in the already existing column. Choose from a list of given values

1 Answers1