Can I use the following code:
df.withColumn("id", df["id"].cast("integer")).na.drop(subset=["id"])
If id
is not a valid integer, it will be NULL and dropped in the subsequent step.
Without changing the type
df = sqlContext.read.text("sample.txt")
df.select(
df.value.substr(1.2).alias('id'),
df.value.substr(3.13).alias('name'),
df.value.substr(16,8).alias('date'),
df.value.substr(24,3).alias('Yes/No')
).show()
valid = df.where(df["id"].cast("integer").isNotNull())
invalid = df.where(df["id"].cast("integer").isNull())
Here my df.printschema
prints
root |-- value: string (nullable = true)
+---+-------------+--------+------+
| id| name | date |Yes/No|
+---+-------------+--------+------+
| 01|abcdefghijklkm |010V2201| 9Ye|
+---+-------------+--------+------+
| ab| abcdefghijklmm|010V2201| 9Ye|
+---+-------------+--------+------+
this is a sample output
Expected result row with integer column to be removed with null or invalid values, can i use df.withcolumn into it ? if i can then how ?