3

I want to separate the column of a Spark DataFrame into 2 different columns depending on the first appearance of a character, which in this case is an underscore ("_").

I have prepared a 100% reproducible example:

The mock Spark DataFrame is:

df = spark.createDataFrame(
    [
     (1, 1.8, 'newyork_3434_north'), 
     (4, 2.6, 'la_432432432_south'), 
     (6, 3.3, 'boston_234324_east'), 
     (8, 4.1, 'detroit_6757_west'), 
     (2, 5.7, 'miami_133123_north'), 
     (3, 6.2, 'atlanta_093394_west'), 
     (1, 6.1, 'houston_87342_east')
    ],
    ('ranking', "coordenate", "city")
)

The above code creates a table like the following one:

ranking  coordenate  city
1        1.8         newyork_3434_north
4        2.6         la_432432432_south
6        3.3         boston_234324_east
8        4.1         detroit_6757_west
2        5.7         miami_133123_north
3        6.2         atlanta_093394_west 
1        6.1         houston_87342_east

What I would like to do is separating the column city in 2 different columns depending on the location of the first underscore from left to right.

The final desired table would be then something like:

ranking  coordenate  city       code
1        1.8         newyork    3434_north
4        2.6         la         432432432_south
6        3.3         boston     234324_east
8        4.1         detroit    6757_west
2        5.7         miami      133123_north
3        6.2         atlanta    093394_west
1        6.1         houston    87342_east

I have seen several threads about this topic, but they do not speak about the first appearance of a character (link_1, link_2, etc.), but splitting by all specific characters that are on a string; or splitting by specific position of the character in the string.

I have also tried Python Pandas approach but as expected, it is not applicable in PySpark by extension or analogy (link_3)

Thank you in advance for your help.

NuValue
  • 453
  • 3
  • 11
  • 28
  • how big's your dataframe? do you really need spark for this? – acushner Jul 26 '18 at 13:44
  • 1
    Hi @acushner, thanks for your reply. I agree with you, it seems very easy and in Pandas the solution is trivial (I already know how to do it locally), but the dataset is 123 GB. I have adapted a very easy and understandable data set in SO for this purpose, but it is a huge one, even selecting only the column that I want ("city"). Thanks in advance! – NuValue Jul 26 '18 at 13:49
  • use split and concat_ws functions – Ramesh Maharjan Jul 26 '18 at 13:55
  • Are there always exactly 2 underscores? – pault Jul 26 '18 at 13:59
  • Hi @pault, the answer would be no. It is just coincidental in the example I illustrated; I have seen registries with more than 2 underscores in my SparkDF. Thank you for your question. – NuValue Jul 26 '18 at 14:01

2 Answers2

4

I think the best option here would be to use pyspark.sql.functions.regexp_extract() and pyspark.sql.functions.regexp_replace():

import pyspark.sql.functions as f

df.select(
    "ranking",
    "coordenate",
    f.regexp_extract("city", pattern="^[A-Za-z]+(?=_)", idx=0).alias('city'),
    f.regexp_replace("city", "^[A-Za-z]+_", "").alias("code")
).show()
#+-------+----------+----------+---------------+
#|ranking|coordenate|      city|           code|
#+-------+----------+----------+---------------+
#|      1|       1.8|   newyork|     3434_north|
#|      4|       2.6|        la|432432432_south|
#|      6|       3.3|    boston|    234324_east|
#|      8|       4.1|   detroit|      6757_west|
#|      2|       5.7|     miami|   133123_north|
#|      3|       6.2|   atlanta|    093394_west|
#|      1|       6.1|   houston|     87342_east|
#+-------+----------+----------+---------------+

In both cases, the pattern is the essentially the same:

  • ^[A-Za-z]+: Match any number of letters starting at the beginning of the string
  • (?=_): Positive lookahead for an underscore

For city we find this pattern and extract the first match. For code we change the look-ahead to a match and replace the pattern with an empty string.


If finding an appropriate regular expression pattern is difficult, here is an alternative method that will work for Spark Version 2.1 and above:

Getting city is straightforward - you can use pyspark.sql.functions.split() to split the string on underscore and then use getItem(0) to get the first element of the split list.

For the code part, split the city on underscore and use pyspark.sql.functions.posexplode() to explode the resultant array. Then filter out for pos > 0, group by the original columns, and use pyspark.sql.functions.concat_ws to join the collected tokens.

df.select(
        "*",
        f.posexplode(f.split("city", "_")).alias("pos", "token")
    )\
    .where("pos > 0")\
    .groupBy("ranking", "coordenate", "city")\
    .agg(f.concat_ws("_" ,f.collect_list("token")).alias("code"))\
    .select(
        "ranking",
        "coordenate",
        f.split("city", "_").getItem(0).alias("city"),
        "code"
    )\
    .show()
pault
  • 41,343
  • 15
  • 107
  • 149
0

@pault has already given an awsome answer using regex and split and concat_ws inbuilt functions

Heres an alternative of doing it simple using udf function as

from pyspark.sql.functions import col, udf
from pyspark.sql.types import ArrayType, StringType

@udf(ArrayType(StringType()))
def splitUdf(x):
    splitted = x.split('_')
    return [splitted[0], '_'.join(splitted[1:])]

df.withColumn('city', splitUdf(col('city')))\
    .select(col('ranking'), col('coordenate'), col('city')[0].alias('city'), col('city')[1].alias('code'))\
    .show()

which should give you

+-------+----------+-------+---------------+
|ranking|coordenate|   city|           code|
+-------+----------+-------+---------------+
|      1|       1.8|newyork|     3434_north|
|      4|       2.6|     la|432432432_south|
|      6|       3.3| boston|    234324_east|
|      8|       4.1|detroit|      6757_west|
|      2|       5.7|  miami|   133123_north|
|      3|       6.2|atlanta|    093394_west|
|      1|       6.1|houston|     87342_east|
+-------+----------+-------+---------------+

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97