5

I have tried the below in Pandas and it works. I wondered how I might do it in PySpark?

The input is

news.bbc.co.uk

it should split it at the '.' and hence index should equal:

[['news', 'bbc', 'co', 'uk'], ['next', 'domain', 'name']]

index = df2.domain.str.split('.').tolist() 

Does anyone know how I'd do this in spark rather than pandas?

Thanks

ernest_k
  • 44,416
  • 5
  • 53
  • 99
kikee1222
  • 1,866
  • 2
  • 23
  • 46
  • Possible duplicate of [Split Contents of String column in PySpark Dataframe](https://stackoverflow.com/questions/41283478/split-contents-of-string-column-in-pyspark-dataframe) and [Splitting a column in pyspark](https://stackoverflow.com/questions/48790246/splitting-a-column-in-pyspark) and [Pyspark Split Columns](https://stackoverflow.com/questions/46835882/pyspark-split-columns?rq=1) – pault Oct 24 '18 at 14:20

3 Answers3

16

Using '.' works in a different way. Using it with escape character '\' actually worked.

df = df.withColumn('col_name', F.split(F.col('col_name'), '\.'))
1

You can use pyspark.sql.functions.split to split str.

import pyspark.sql.functions as F

df = df.withColumn('col_name', F.split(F.col('col_name'), '.'))
mayank agrawal
  • 2,495
  • 2
  • 13
  • 32
0
df.select(split("col_name", '[\.]'))

or

df.selectExpr("split(col_name, '[\.]')")
Akshat Chaturvedi
  • 678
  • 1
  • 7
  • 15