I have this code in pyspark
where in I pass the index
value of columns as a list
. Now I want to select the columns from csv
file for these corresponding indexes:
def ml_test(input_col_index):
sc = SparkContext(master='local', appName='test')
inputData = sc.textFile('hdfs://localhost:/dir1').zipWithIndex().filter(lambda (line, rownum): rownum >= 0).map(lambda (line, rownum): line)
if __name__ == '__main__':
input_col_index = sys.argv[1] # For example - ['1','2','3','4']
ml_test(input_col_index)
Now if I have a static or hardcoded set of columns that I want to select from above csv
file, I can do that but here the indexes
of desired columns is being passed as a parameter. Also I have to calculate the distinct length of each of the selected columns which I know can be done by colmn_1 = input_data.map(lambda x: x[0]).distinct().collect()
but how do I do this for set of columns which are not pre-known and are determined based on the index list passed at runtime?
NOTE: I have to calculate the distinct length of columns because I have to pass that length as a parameter to Pysparks
RandomForest
algorithm.