2

I imported the data from a database

df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://127.0.0.1/test.db").load()

I have selected the double columns using

double_list = [name for name,types in df.dtypes if types == 'double']

Credits to @Ramesh Maharjan. To remove special characters we use

removedSpecials = [''.join(y for y in x if y.isalnum()) for x in double_list]

The question is:

How can I create a new dataframe based on df with ONLY double_list columns. ?

PolarBear10
  • 2,065
  • 7
  • 24
  • 55
  • @RameshMaharjan the other one is to just get the datatype of a column, this question is 'creating a new dataframe in pyspark based on the datatype of the column' – PolarBear10 Jul 06 '18 at 09:49
  • @RameshMaharjan no, I am afraid not. My question is how to remove special characters from column type double. I already know how to print the schema and find the double columns. The question is how to create a new dataframe, based on the columns double. – PolarBear10 Jul 06 '18 at 10:34
  • @RameshMaharjan No worries, I was not clear enough. I slighly modified the question. – PolarBear10 Jul 06 '18 at 11:15
  • so you just want to rename with the special characters removed on the column names that have double as datatype . is that correct? – Ramesh Maharjan Jul 06 '18 at 11:19
  • @RameshMaharjan yes, you are on point ! – PolarBear10 Jul 06 '18 at 11:20

2 Answers2

2

If you already have list of column names with double as datatype then next step is to remove the special characters which can be done by using .isalnum() credit as

removedSpecials = [''.join(y for y in x if y.isalnum()) for x in double_list]

once you have the special characters removed list of column names then its just .withColumnRenamed() api call as

for (x, y) in zip(double_list, removedSpecials):
    df = df.withColumnRenamed(x, y)

df.show(truncate=False) should give you the renamed dataframe on the columns with double datatype

If you don't want the columns that are not in double_list i.e. not in double datatype list then you can use select as

df.select(*removedSpecials).show(truncate=False)

The reason for doing

for (x, y) in zip(double_list, removedSpecials):
    df = df.withColumnRenamed(x, y) 

before doing

df.select(*removedSpecials).show(truncate=False)

is that there might be special characters like . which doesn't make concise solutions like df.select([df[x].alias(y) for (x, y) in zip(double_list, removedSpecials)]).show(truncate=False) to work

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • `df = df.withColumnRenamed(x,y)` will now also have columns that were NOT in the `double_list`. The idea is to create a new dataFrame with only the `double_list` columns. All other columns that are not in `double_list` are dropped. – PolarBear10 Jul 06 '18 at 11:32
  • 2
    so you don't want the columns that are not in double_list? – Ramesh Maharjan Jul 06 '18 at 11:32
  • I do not want the columns that are not in `double_list´ – PolarBear10 Jul 06 '18 at 11:34
  • 1
    I am trying to pass `df.select(*removedSpecials)` to a new variable, but simply doing `df_new = df.select(*removedSpecials)` does not seem to do the trick. Is there a way to create a new independent copy variable from the original dataframe with only those columns ? – PolarBear10 Jul 06 '18 at 13:25
  • actually, I misunderstood the error. Here it is **Caused by: org.apache.spark.SparkException: Values to assemble cannot be null** . The next step will be to drop all empty and blank columns from the `df.select(*removedSpecials)`. Phew ! gotta figure that out. – PolarBear10 Jul 06 '18 at 13:47
  • 1
    try it. if you can't then research on the web and I know you can get a lot of infos in stackoverflow itself. If all of them fails, you are always welcome for another question – Ramesh Maharjan Jul 06 '18 at 13:50
0

scala code, you can convert into python

import sqlContext.implicits._
// sample df
df.show()
+----+--------------------+--------+
|data|                Week|NumCCol1|
+----+--------------------+--------+
| aac|01/28/2018-02/03/...|     2.0|
| aac|02/04/2018-02/10/...|    23.0|
| aac|02/11/2018-02/17/...|   105.0|
+----+--------------------+--------+

df.printSchema()

root
 |-- data: string (nullable = true)
 |-- Week: string (nullable = true)
 |-- NumCCol1: double (nullable = false)

 val df1 = df.schema.fields
      .collect({case x if x.dataType.typeName == "double" => x.name})
      .foldLeft(df)({case(dframe,field) => dframe.select(field)})

 // df with only double columns
 df1.show()

use df1.withColumnRenamed to rename the columns

Kishore
  • 5,761
  • 5
  • 28
  • 53