I have a really big table representing points (>30 million points). It can have two or tree columns representing x,y,z
Unfortunately, some of this columns can have strings ('nan','nulo','vazio',etc) and they can change from file to file but are constant inside the table
I need a way to remove this strings and replacing them with nulls or removing the row
What i did is in the picture and in the code below, is there a better why? more flexible? (this code only works for 3d)
def import_file(self,file_path:str,sep:str=',',null_values:str=''):
#read table
table = self.spark.read.load(path=file_path, \
format='csv', \
sep=sep, \
header=False).toDF('x','y','z')
#change the letters to ''
table.withColumn('x',regexp_replace('x','[a-z]',''))
table.withColumn('y',regexp_replace('z','[a-z]',''))
table.withColumn('z',regexp_replace('z','[a-z]',''))
#replace '' for nulls or TODO:remove columns
table.replace('',None)
return table