2

I have a spark data frame and I want to do array = np.array(df.collect()) on all my columns except on the first one (which I want to select by name or number). How do I do that?

pault
  • 41,343
  • 15
  • 107
  • 149
LN_P
  • 1,448
  • 4
  • 21
  • 37
  • Use `drop`: `array = np.array(df.drop("some_column_to_exclude").collect())` or a list comp: `array = np.array(df.select(*[c for c in df.columns if c != "some_column_to_exclude"]).collect())`. Looking for a dupe... – pault Nov 01 '18 at 17:28

2 Answers2

1

I did it that way:

s = list(set(con.columns) - {'FAULTY'}) 

array = np.array(con.select(s).collect())
LN_P
  • 1,448
  • 4
  • 21
  • 37
  • 1
    This is fine as long as you don't care about maintaining the order of the columns. However, using `drop` here would be my recommendation. `array = np.array(df.drop("FAULTY").collect())`. Or since it's the first column, you can do `array = np.array(con.select(con.columns[1:]).collect())` – pault Nov 01 '18 at 17:34
0

You can try,

first_col = 'name_of_your_first_column' 
df_exclude = df.select([cols for cols in df.columns if cols not in first_col]).collect()
pvy4917
  • 1,768
  • 17
  • 23