Create new dataFrame based on reformatted columns from old dataFrame

Question

I imported the data from a database

df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://127.0.0.1/test.db").load()

I have selected the double columns using

double_list = [name for name,types in df.dtypes if types == 'double']

Credits to @Ramesh Maharjan. To remove special characters we use

removedSpecials = [''.join(y for y in x if y.isalnum()) for x in double_list]

The question is:

How can I create a new dataframe based on df with ONLY double_list columns. ?

@RameshMaharjan the other one is to just get the datatype of a column, this question is 'creating a new dataframe in pyspark based on the datatype of the column' — PolarBear10, Jul 06 '18 at 09:49
@RameshMaharjan no, I am afraid not. My question is how to remove special characters from column type double. I already know how to print the schema and find the double columns. The question is how to create a new dataframe, based on the columns double. — PolarBear10, Jul 06 '18 at 10:34
@RameshMaharjan No worries, I was not clear enough. I slighly modified the question. — PolarBear10, Jul 06 '18 at 11:15
so you just want to rename with the special characters removed on the column names that have double as datatype . is that correct? — Ramesh Maharjan, Jul 06 '18 at 11:19

Ramesh Maharjan · Accepted Answer · 2018-07-06T12:21:57.787

2

If you already have list of column names with double as datatype then next step is to remove the special characters which can be done by using .isalnum() credit as

removedSpecials = [''.join(y for y in x if y.isalnum()) for x in double_list]

once you have the special characters removed list of column names then its just .withColumnRenamed() api call as

for (x, y) in zip(double_list, removedSpecials):
    df = df.withColumnRenamed(x, y)

df.show(truncate=False) should give you the renamed dataframe on the columns with double datatype

If you don't want the columns that are not in double_list i.e. not in double datatype list then you can use select as

df.select(*removedSpecials).show(truncate=False)

The reason for doing

for (x, y) in zip(double_list, removedSpecials):
    df = df.withColumnRenamed(x, y)

before doing

df.select(*removedSpecials).show(truncate=False)

is that there might be special characters like . which doesn't make concise solutions like df.select([df[x].alias(y) for (x, y) in zip(double_list, removedSpecials)]).show(truncate=False) to work

I hope the answer is helpful

edited Jul 06 '18 at 12:21

answered Jul 06 '18 at 11:25

Ramesh Maharjan

41,071
6
69
97

`df = df.withColumnRenamed(x,y)` will now also have columns that were NOT in the `double_list`. The idea is to create a new dataFrame with only the `double_list` columns. All other columns that are not in `double_list` are dropped. – PolarBear10 Jul 06 '18 at 11:32
2

so you don't want the columns that are not in double_list? – Ramesh Maharjan Jul 06 '18 at 11:32
I do not want the columns that are not in `double_list´ – PolarBear10 Jul 06 '18 at 11:34
1

I am trying to pass `df.select(*removedSpecials)` to a new variable, but simply doing `df_new = df.select(*removedSpecials)` does not seem to do the trick. Is there a way to create a new independent copy variable from the original dataframe with only those columns ? – PolarBear10 Jul 06 '18 at 13:25
actually, I misunderstood the error. Here it is **Caused by: org.apache.spark.SparkException: Values to assemble cannot be null** . The next step will be to drop all empty and blank columns from the `df.select(*removedSpecials)`. Phew ! gotta figure that out. – PolarBear10 Jul 06 '18 at 13:47
1

try it. if you can't then research on the web and I know you can get a lot of infos in stackoverflow itself. If all of them fails, you are always welcome for another question – Ramesh Maharjan Jul 06 '18 at 13:50

Kishore · Answer 2 · 2018-07-06T09:32:48.297

scala code, you can convert into python

import sqlContext.implicits._
// sample df
df.show()
+----+--------------------+--------+
|data|                Week|NumCCol1|
+----+--------------------+--------+
| aac|01/28/2018-02/03/...|     2.0|
| aac|02/04/2018-02/10/...|    23.0|
| aac|02/11/2018-02/17/...|   105.0|
+----+--------------------+--------+

df.printSchema()

root
 |-- data: string (nullable = true)
 |-- Week: string (nullable = true)
 |-- NumCCol1: double (nullable = false)

 val df1 = df.schema.fields
      .collect({case x if x.dataType.typeName == "double" => x.name})
      .foldLeft(df)({case(dframe,field) => dframe.select(field)})

 // df with only double columns
 df1.show()

use df1.withColumnRenamed to rename the columns

Create new dataFrame based on reformatted columns from old dataFrame

2 Answers2