1

I'm having trouble dropping a column in a Spark dataframe that has a period. I know that you need to escape the column name using backticks (`). This works when I attempt to select columns, and indeed I've written my own little static function to escape all column names:

@staticmethod
def escape(columns):
    return ["`" + col + "`" if "." in col else col for col in columns]

This can then be used to get my desired list of columns to select by:

desired_columns = MySparkClass.escape(
    list(filter(lambda col: re.search('targetRegexStuffHere', col), target_df.columns))
)

filtered_df = df.select(desired_columns)

Using a trivial, reproducible example:

same = sqlContext.createDataFrame(
    [
        (1, 1, 'A', '2017-01-01'),
        (2, 3, 'B', '2017-01-02'),
        (3, 5, 'A', '2017-01-03'),
        (4, 7, 'B', '2017-01-04')
    ],
    ('index', 'X', 'label.X.L.', 'date')
)

print(same.select('`label.X.L.`').collect())

Output here is:

[Row(label.X.L.='A'), Row(label.X.L.='B'), Row(label.X.L.='A'), Row(label.X.L.='B')]

However, removing the backticks results in an AnalysisException:

pyspark.sql.utils.AnalysisException: 'syntax error in attribute name: label.X.L.;'

When I attempt to drop the label.X.L. column, however, the backticks appear to not make any difference:

print(same.drop('`label.X.L.`').collect())

Output is

[Row(index=1, X=1, label.X.L.='A', date='2017-01-01'),
 Row(index=2, X=3, label.X.L.='B', date='2017-01-02'),
 Row(index=3, X=5, label.X.L.='A', date='2017-01-03'),
 Row(index=4, X=7, label.X.L.='B', date='2017-01-04')]

What is the proper way to drop a column that contains a period within its name?

pault
  • 41,343
  • 15
  • 107
  • 149
Yu Chen
  • 6,540
  • 6
  • 51
  • 86
  • Can you add an [mcve]? I'm unable to reproduce your issue. Please read more on [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). – pault Jul 09 '18 at 20:21
  • As an aside, another way to drop columns is by selecting all of the other columns: `df = df.select(*[c for c in df.columns if c != "favorite_country_U.S.A"])` – pault Jul 09 '18 at 20:24
  • @pault I have added an example – Yu Chen Jul 09 '18 at 20:44
  • Interesting - the `drop()` seems to work for me without the backticks: `print(same.drop('label.X.L.').collect())` - what version of spark? – pault Jul 09 '18 at 21:08
  • @pault Um, you were right. I feel like an idiot. If I just do `print(same.drop('label.X.L.').collect())` I get the desired output. So it looks like `drop()` doesn't require backticks, but `select()` does? – Yu Chen Jul 09 '18 at 21:23

1 Answers1

2

The syntax for specifying which columns to use for select() and for drop() slightly different. When you have a period in your column name for select():

same.select('`label.X.L.`') # note the backticks

However, when you are attempting to drop:

same.drop('label.X.L.') # note the absence of the backticks
Yu Chen
  • 6,540
  • 6
  • 51
  • 86