4

I have a column full of arrays containing split http requests. I have them filtered down to one of two possibilities:

|[, courses, 27381...|
|[, courses, 27547...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, courses, 33287...|
|[, courses, 24024...|

In both array-types, from 'courses' onward is the same data and structure.

I want to take the slice of the array using a case statement where if the first element of the array is 'api', then take elements 3 -> end of the array. I've tried using Python slice syntax [3:], and normal PostgreSQL syntax [3, n] where n is the length of the array. If it's not 'api', then just take the given value.

My ideal end-result would be an array where every row shares the same structure, with courses in the first index for easier parsing from that point onwards.

flybonzai
  • 3,763
  • 11
  • 38
  • 72
  • the given value == whole array? – Alberto Bonsanto Jun 09 '16 at 21:50
  • given value would just be the row. So for example, if the array is `['courses', 'etc...', ...]` then leave the value as that, but if it were `['api', 'v1', 'courses', 'something...']` then set the value to `['courses', 'something', ...]` – flybonzai Jun 09 '16 at 21:53
  • I think a udf would come in handy here. see https://ragrawal.wordpress.com/2015/10/02/spark-custom-udf-example/ – Saif Charaniya Jun 09 '16 at 22:08

2 Answers2

2

It's very easy just define a UDF, you made a very similar question previously so I won't post the exact answer to let you think and learn (for your own good).

from pyspark.sql.functions import udf

df = sc.parallelize([(["ab", "bs", "xd"],), (["bc", "cd", ":x"],)]).toDF()

getUDF = udf(lambda x, y: x[1:] if x[y] == "ab" else x)

df.select(getUDF(col("_1"), lit(0))).show()

+------------------------+
|PythonUDF#<lambda>(_1,0)|
+------------------------+
|                [bs, xd]|
|            [bc, cd, :x]|
+------------------------+
nessa.gp
  • 1,804
  • 21
  • 20
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
  • So if I pass in a literal lower and upper bound into the `UDF`, then I can just return it as such `array[lbound:ubound]` and set my `UDF` return value to `ArrayType(StringType())`? – flybonzai Jun 09 '16 at 22:51
  • 1
    Type is inferred by the `UDF`, but yes for the rest. – Alberto Bonsanto Jun 09 '16 at 22:52
  • sometimes you want to avoid UDF for optimizing your code. [Here](http://stackoverflow.com/questions/40134975/selecting-a-range-of-elements-in-an-array-spark-sql/40147113#40147113) is an example in Scala. But the logic here should be very similar by defining a `slice` function `(Int,Int) => Column` and using `when` and `otherwise` from the `Column` API. – Wilmerton Oct 20 '16 at 07:26
0

Assuming that the column in your Dataframe is called http_col and the first item in the array is an empty string, a possible solution is:

df.selectExpr(
  """if(array_contains(http_col, 'api'),
        slice(http_col, 4, size(http_col) - 3),
        http_col) as cleaned_http_col
  """
)

If you have Spark >= 2.4.0 another option could be:

df.selectExpr(
  "array_remove(array_remove(http_col, 'api'), 'v1') as cleaned_http_col"
)
datapug
  • 2,261
  • 1
  • 17
  • 33