i am having error with the code in pyspark related with df.withcolumn

Question

Can I use the following code:

df.withColumn("id", df["id"].cast("integer")).na.drop(subset=["id"])

If id is not a valid integer, it will be NULL and dropped in the subsequent step.

Without changing the type

df = sqlContext.read.text("sample.txt") 
df.select(
    df.value.substr(1.2).alias('id'),
    df.value.substr(3.13).alias('name'),
    df.value.substr(16,8).alias('date'),
    df.value.substr(24,3).alias('Yes/No')
).show()

valid = df.where(df["id"].cast("integer").isNotNull())
invalid = df.where(df["id"].cast("integer").isNull())

Here my df.printschema prints

root
|-- value: string (nullable = true)

+---+-------------+--------+------+
| id| name          | date |Yes/No| 
+---+-------------+--------+------+ 
| 01|abcdefghijklkm |010V2201| 9Ye| 
+---+-------------+--------+------+
| ab| abcdefghijklmm|010V2201| 9Ye| 
+---+-------------+--------+------+

this is a sample output

Expected result row with integer column to be removed with null or invalid values, can i use df.withcolumn into it ? if i can then how ?

I think you're asking: "Do I need to convert to integer before dropping nulls?" It depends on what your desired output would be- converting a string to integer can introduce new `null` values if the string can not be converted. A small [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) with sample input and the desired output would be helpful in clearing up what you're asking, because it's unclear to me at the moment. — pault, May 08 '19 at 14:31
Hello Pault,this part of my code df = sqlContext.read.text("sample.txt") df.select( df.value.substr(1.2).alias('id'), df.value.substr(3.13).alias('name'), df.value.substr(16,8).alias('date'), df.value.substr(24,3).alias('Yes/No') ).show() this prints the txt file in Data frame can i use any method to make sure that the column id will skip printing rows where there are no integer values in my string, — Vinit T, May 09 '19 at 05:15
+---+-------------+--------+------+ | id| name| date|Yes/No| +---+-------------+--------+------+ | 01| abcdefghijklkm|010V2201| 9Ye| +---+-------------+--------+------+ | ab| abcdefghijklmm|010V2201| 9Ye| +---+-------------+--------+------+ this is a sample output, i want the id column to not print those rows where id is not an integer i would the id c — Vinit T, May 09 '19 at 05:24
Don't put code/data/output in the comments. [Edit] them into your question. — pault, May 09 '19 at 14:19
hello Pault, i have edited my question and put the sample output there — Vinit T, May 10 '19 at 05:18

i am having error with the code in pyspark related with df.withcolumn

0 Answers0