-1

I'm using spark 2.3

I have a DataFrame like this (in other situation _c0 may contains 20 inner fields):

_c0                     | _c1
-----------------------------
1.1   1.2          4.55 | a
4.44  3.1          9.99 | b
1.2   99.88        10.1 | x

I want to split _c0, and create new DataFrame like this:

col1 |col2  |col3 |col4
-----------------------------
1.1  |1.2   |4.55 | a
4.44 |3.1   |9.99 | b
1.2  |99.88 |10.1 | x

I know how to solve this using getItem():

df = originalDf.rdd.map(lambda x: (re.split(" +",x[0]),x[1])).toDF()
# now, df[0] is a array of string , and df[1] is string
df = df.select(df[0].getItem(0), df[0].getItem(1), df[0].getItem(2), df[1])

But I hoped to find a different way to solve this, because _c0 may contain more than 3 inner column.

Is there a way to use flatMap to generate the df?

Is there a way to insert df[1] as inner field of df[0]?

Is there a way to use df[0].getItem(), so it returns all inner fields?

Is there a simpler way to generate the data-frame?

Any help will be appreciated

Thanks

Nir
  • 601
  • 7
  • 21
  • pls share the structure of your dataframe – thebluephantom Nov 25 '18 at 11:35
  • Possible duplicate of [Split Spark Dataframe string column into multiple columns](https://stackoverflow.com/questions/39235704/split-spark-dataframe-string-column-into-multiple-columns) – pault Nov 26 '18 at 15:47
  • pault, I hoped to find a simple way to do it without using getItem() because I have many inner fields – Nir Nov 27 '18 at 07:22

1 Answers1

0

Use df split function and regex pattern for whitespaces ("\\s+"). Docs: https://spark.apache.org/docs/2.3.1/api/python/_modules/pyspark/sql/functions.html

def split(str, pattern):
    """
    Splits str around pattern (pattern is a regular expression).

    .. note:: pattern is a string represent the regular expression.

    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    [Row(s=[u'ab', u'cd'])]
    """
    sc = SparkContext._active_spark_context
    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))

Then you can use getItem on array col to get particular field value.

morsik
  • 1,250
  • 14
  • 17