I have a spark data frame and I want to do array = np.array(df.collect())
on all my columns except on the first one (which I want to select by name or number). How do I do that?
Asked
Active
Viewed 1.0k times
2
-
Use `drop`: `array = np.array(df.drop("some_column_to_exclude").collect())` or a list comp: `array = np.array(df.select(*[c for c in df.columns if c != "some_column_to_exclude"]).collect())`. Looking for a dupe... – pault Nov 01 '18 at 17:28
2 Answers
1
I did it that way:
s = list(set(con.columns) - {'FAULTY'})
array = np.array(con.select(s).collect())

LN_P
- 1,448
- 4
- 21
- 37
-
1This is fine as long as you don't care about maintaining the order of the columns. However, using `drop` here would be my recommendation. `array = np.array(df.drop("FAULTY").collect())`. Or since it's the first column, you can do `array = np.array(con.select(con.columns[1:]).collect())` – pault Nov 01 '18 at 17:34
0
You can try,
first_col = 'name_of_your_first_column'
df_exclude = df.select([cols for cols in df.columns if cols not in first_col]).collect()

pvy4917
- 1,768
- 17
- 23