2

I have a really big table representing points (>30 million points). It can have two or tree columns representing x,y,z

Unfortunately, some of this columns can have strings ('nan','nulo','vazio',etc) and they can change from file to file but are constant inside the table

I need a way to remove this strings and replacing them with nulls or removing the row

What i did is in the picture and in the code below, is there a better why? more flexible? (this code only works for 3d)

def import_file(self,file_path:str,sep:str=',',null_values:str=''):  
 
 #read table
 table =  self.spark.read.load(path=file_path, \
 format='csv', \
 sep=sep, \
 header=False).toDF('x','y','z')
 
 #change the letters to ''
 table.withColumn('x',regexp_replace('x','[a-z]',''))
 table.withColumn('y',regexp_replace('z','[a-z]',''))
 table.withColumn('z',regexp_replace('z','[a-z]',''))

 #replace '' for nulls or TODO:remove columns
 table.replace('',None)

 return table
VVT
  • 72
  • 1
  • 7

3 Answers3

3

Another way can be using an UDF to mark strings , and further based on whatever combination of rows across the columns you want to drop you can easily do it

import pyspark.sql.functions as F
import pandas as pd
import numpy as np

@F.udf(returnType=BooleanType())
def mark_strings(inp):

  #### Check if inp is string or not , assuming here you can have numeric rows as well which are to be returned as is

  if isinstance(inp,str) and not pd.isnull(inp):
    if inp.isalpha():
       return True
  
  return False


@F.udf(returnType=StringType())
def replace_strings(inp):

  #### Check if inp is string or not , assuming here you can have numeric rows as well which are to be returned as is

  if isinstance(inp,str) and not pd.isnull(inp):
    if inp.isalpha():
       return np.nan
  
  return inp

Droppping Data Rows

table = table.withColumn('x_str_bool',mark_strings(F.col('x')))
table = table.withColumn('y_str_bool',mark_strings(F.col('y')))
table = table.withColumn('z_str_bool',mark_strings(F.col('z')))

##### Assuming if you only want to remove string data rows based on a combination of x and y.

table_filter = table.filter((F.col('x_str_bool') == False) &
(F.col('y_str_bool') == False))

Replacing Data Rows

table = table.withColumn('x',replace_strings(F.col('x')))
table = table.withColumn('y',replace_strings(F.col('y')))
table = table.withColumn('z',replace_strings(F.col('z')))
Vaebhav
  • 4,672
  • 1
  • 13
  • 33
  • I would expect that an UDF (especially when written in Python) is always slower than a built-in SQL function like `regexp_replace`. – werner Jun 05 '21 at 11:37
  • @werner - I totally agree , `regex_replace` is superior. The ans is not more on the optimization rather than another way of doing it using an udf – Vaebhav Jun 05 '21 at 11:51
  • @werner why? maybe i am wrong and i am really noob with hadoop and spark but with spark using map reduce why you can assume udf are slower and sql functions? – VVT Jun 05 '21 at 18:20
  • 1
    @VvT please have a look at [this answer](https://stackoverflow.com/a/38297050/2129801). I think it could be helpful – werner Jun 05 '21 at 20:52
2

You could use the answer here to remove rows you can not cast to integer. That way you don’t need to use a UDF. how to check if a string column in pyspark dataframe is all numeric

It would be something like this:

table = table.filter(col(“x”).cast(“int”).isNotNull())
jrip
  • 140
  • 1
  • 10
1

If your expectation is just numbers (assuming that we're only talking about point's location here), then you can cast the whole column to integer/double, the one that is not a number would be None

def import_file(self,file_path:str,sep:str=',',null_values:str=''):  
 
  #read table
  #...
 
  for c in table.columns:
    table = table.withColumn(c, F.col(c).cast('integer')) # or double

  return table
pltc
  • 5,836
  • 1
  • 13
  • 31