How to check if a string column in pyspark dataframe is all numeric

Question

I have a PySpark Dataframe with a column of strings. How can I check which rows in it are Numeric. I could not find any function in PySpark's official documentation.

values = [('25q36',),('75647',),('13864',),('8758K',),('07645',)]
df = sqlContext.createDataFrame(values,['ID',])
df.show()
+-----+
|   ID|
+-----+
|25q36|
|75647|
|13864|
|8758K|
|07645|
+-----+

In Python, there is a function .isDigit() which returns True or False if the string contains just numbers or not.

Expected DataFrame:

+-----+-------+
|   ID| Value |
+-----+-------+
|25q36| False |
|75647| True  |
|13864| True  |
|8758K| False |
|07645| True  |
+-----+-------+

I would like to avoid creating a UDF.

score 28 · Accepted Answer · answered Dec 12 '18 at 13:56

28

A simple cast would do the job :

from pyspark.sql import functions as F

my_df.select(
  "ID",
  F.col("ID").cast("int").isNotNull().alias("Value ")
).show()

+-----+------+
|   ID|Value |
+-----+------+
|25q36| false|
|75647|  true|
|13864|  true|
|8758K| false|
|07645|  true|
+-----+------+

answered Dec 12 '18 at 13:56

Steven

14,048
6
38
73

Thanks Steven. This definitely works. I thought that might be some inbuilt function as well. If I don't find a one, I will click this one as an answer. – cph_sto Dec 12 '18 at 14:04
**Pay Attention!** It's better to use `long` rather than `int` because of the integer range of -2,147,483,647 to +2,147,483,647. – Mohammad-Reza Malekpour Aug 17 '21 at 12:12

score 5 · Answer 2 · answered Nov 08 '20 at 13:53

5

I agree to @steven answer but there is a slight modification since I want the whole table to be filtered out. PFB

df2.filter(F.col("id").cast("int").isNotNull()).show()

Also there is no need to create a new column called Values

Alternative solution similar to above is -

display(df2.filter(f"CAST({'id'} as INT) IS NOT NULL")

answered Nov 08 '20 at 13:53

Mohseen Mulla

542
7
15

OP wants a new column, not a filter – Steven Jun 07 '22 at 13:24

score 5 · Answer 3 · answered Aug 17 '21 at 11:53

5

Filtering with Regex

Indeed I enjoyed the creative solution provided by Steven but here is my much easier suggestion for this kind of situation:

df.filter(~df.ID.rlike('\D+')).show()

Firstly, you select every row which contains a non-digits character with rlike('\D+') and then excluding those rows with ~ at the beginning of the filter.

answered Aug 17 '21 at 11:53

Mohammad-Reza Malekpour

787
10
18

OP wants a new column, not a filter – Steven Jun 07 '22 at 13:25

score 2 · Answer 4 · answered Feb 22 '21 at 18:57

2

The clearest way to search for non-numeric rows would be something like this:

from pyspark.sql import functions as F

df.select("col_a",F.regexp_replace(col("col_a"), "[^0-9]", "").alias("numeric"))\
    .filter(col("col_a")!=col("numeric"))\
    .distinct()\
    .show()

answered Feb 22 '21 at 18:57

Bramvd3

21
2

OP wants a new column, not a filter – Steven Jun 07 '22 at 13:25

score 1 · Answer 5 · answered Dec 18 '18 at 16:27

1

If you want you can also build a custom udf for this purpose:

from pyspark.sql.types import BooleanType
from pyspark.sql import functions as F

def is_digit(val):
    if val:
        return val.isdigit()
    else:
        return False

is_digit_udf = udf(is_digit, BooleanType())

df = df.withColumn('Value', F.when(is_digit_udf(F.col('ID')), F.lit(True)).otherwise(F.lit(False)))

answered Dec 18 '18 at 16:27

Manrique

2,083
3
15
38

OP specifically asked for no UDF. – Steven Jun 07 '22 at 13:26

score 1 · Answer 6 · answered Aug 10 '20 at 04:33

1

df=spark.read.option("header", "true").csv("source_table.csv")
df=df.withColumn("is_valid",lit("true"))
df.withColumn("is_valid",when(col("age").cast("int").isNotNull(),col("is_valid")).otherwise("false")).show()  # this will work 
#if you want to use rlike this will work
pattern="^[0-9]*$"
source_df=df.withColumn("is_valid",
               when(col("age").rlike(pattern), col("is_valid")).otherwise("false"))

answered Aug 10 '20 at 04:33

shivar

11
1

Very nice, because you avoided the use of an udf function.. – Artemination Sep 06 '21 at 15:48

score 0 · Answer 7 · edited Sep 16 '19 at 13:41

0

Try this, is Scala language

spark.udf.register("IsNumeric", (inpColumn: Int) => BigInt(inpColumn).isInstanceOf[BigInt])
spark.sql(s""" select "ABCD", IsNumeric(1234) as IsNumeric_1  """).show(false)

edited Sep 16 '19 at 13:41

cph_sto

7,189
12
42
78

answered Aug 21 '19 at 19:48

Andy Quiroz

833
7
8

How to check if a string column in pyspark dataframe is all numeric

7 Answers7

Filtering with Regex

Linked

Related