24

I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like:

colA, colB, colC, colD, colE, colF-0, colF-1, colF-2

I know I can do like this to select specific columns:

df.select("colA", "colB", "colE")

but how to select, say "colA", "colB" and all the colF-* columns at once? Is there a way like in Pandas?

Community
  • 1
  • 1
user299791
  • 2,021
  • 3
  • 31
  • 57

4 Answers4

33

The process canbe broken down into following steps:

  1. First grab the column names with df.columns,
  2. then filter down to just the column names you want .filter(_.startsWith("colF")). This gives you an array of Strings.
  3. But the select takes select(String, String*). Luckily select for columns is select(Column*), so finally convert the Strings into Columns with .map(df(_)),
  4. and finally turn the Array of Columns into a var arg with : _*.

df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show

This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):

df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show 

If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.

df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show
Vivek Puurkayastha
  • 466
  • 1
  • 9
  • 18
Michael Lloyd Lee mlk
  • 14,561
  • 3
  • 44
  • 81
13

Python (tested in Azure Databricks)

selected_columns = [column for column in df.columns if column.startswith("colF")]
df2 = df.select(selected_columns)
Eugene Lycenok
  • 603
  • 6
  • 14
1

In PySpark, use: colRegex to select columns starting with colF Whit the sample:

colA, colB, colC, colD, colE, colF-0, colF-1, colF-2

Apply:

df.select(col("colA"), col("colB"), df.colRegex("`(colF)+?.+`")).show()

The result is:

colA, colB, colF-0, colF-1, colF-2
-1

I wrote a function that does that. Read the comments to see how it works.

  /**
    * Given a sequence of prefixes, select suitable columns from [[DataFrame]]
    * @param columnPrefixes Sequence of prefixes
    * @param dF Incoming [[DataFrame]]
    * @return [[DataFrame]] with prefixed columns selected
    */
  def selectPrefixedColumns(columnPrefixes: Seq[String], dF: DataFrame): DataFrame = {
    // Find out if given column name matches any of the provided prefixes
    def colNameStartsWith: String => Boolean = (colName: String) =>
        columnsPrefix.map(prefix => colName.startsWith(prefix)).reduce(_ || _)
    // Filter columns list by checking against given prefixes sequence
    val columns = dF.columns.filter(colNameStartsWith)
    // Select filtered columns list
    dF.select(columns.head, columns.tail:_*)
  }
kfkhalili
  • 996
  • 1
  • 11
  • 24