How to pull the slice of an array in Spark SQL (Dataframes)?

Question

I have a column full of arrays containing split http requests. I have them filtered down to one of two possibilities:

|[, courses, 27381...|
|[, courses, 27547...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, courses, 33287...|
|[, courses, 24024...|

In both array-types, from 'courses' onward is the same data and structure.

I want to take the slice of the array using a case statement where if the first element of the array is 'api', then take elements 3 -> end of the array. I've tried using Python slice syntax [3:], and normal PostgreSQL syntax [3, n] where n is the length of the array. If it's not 'api', then just take the given value.

My ideal end-result would be an array where every row shares the same structure, with courses in the first index for easier parsing from that point onwards.

given value would just be the row. So for example, if the array is `['courses', 'etc...', ...]` then leave the value as that, but if it were `['api', 'v1', 'courses', 'something...']` then set the value to `['courses', 'something', ...]` — flybonzai, Jun 09 '16 at 21:53
I think a udf would come in handy here. see https://ragrawal.wordpress.com/2015/10/02/spark-custom-udf-example/ — Saif Charaniya, Jun 09 '16 at 22:08

score 2 · Accepted Answer · edited Jun 25 '18 at 20:47

2

It's very easy just define a UDF, you made a very similar question previously so I won't post the exact answer to let you think and learn (for your own good).

from pyspark.sql.functions import udf

df = sc.parallelize([(["ab", "bs", "xd"],), (["bc", "cd", ":x"],)]).toDF()

getUDF = udf(lambda x, y: x[1:] if x[y] == "ab" else x)

df.select(getUDF(col("_1"), lit(0))).show()

+------------------------+
|PythonUDF#<lambda>(_1,0)|
+------------------------+
|                [bs, xd]|
|            [bc, cd, :x]|
+------------------------+

edited Jun 25 '18 at 20:47

nessa.gp

1,804
21
20

answered Jun 09 '16 at 22:49

Alberto Bonsanto

17,556
10
64
93

So if I pass in a literal lower and upper bound into the `UDF`, then I can just return it as such `array[lbound:ubound]` and set my `UDF` return value to `ArrayType(StringType())`? – flybonzai Jun 09 '16 at 22:51
1

Type is inferred by the `UDF`, but yes for the rest. – Alberto Bonsanto Jun 09 '16 at 22:52
sometimes you want to avoid UDF for optimizing your code. [Here](http://stackoverflow.com/questions/40134975/selecting-a-range-of-elements-in-an-array-spark-sql/40147113#40147113) is an example in Scala. But the logic here should be very similar by defining a `slice` function `(Int,Int) => Column` and using `when` and `otherwise` from the `Column` API. – Wilmerton Oct 20 '16 at 07:26

score 0 · Answer 2 · answered Aug 08 '19 at 21:49

Assuming that the column in your Dataframe is called http_col and the first item in the array is an empty string, a possible solution is:

df.selectExpr(
  """if(array_contains(http_col, 'api'),
        slice(http_col, 4, size(http_col) - 3),
        http_col) as cleaned_http_col
  """
)

If you have Spark >= 2.4.0 another option could be:

df.selectExpr(
  "array_remove(array_remove(http_col, 'api'), 'v1') as cleaned_http_col"
)

How to pull the slice of an array in Spark SQL (Dataframes)?

2 Answers2