0

I have a small problem. I would like to delete any row that contains 'NULL'.

This is my input file:

matricule,dateins,cycle,specialite,bourse,sport
0000000001,1999-11-22,Master,IC,Non,Non
0000000002,2014-02-01,Null,IC,Null,Oui
0000000003,2006-09-07,Null,Null,Oui,Oui
0000000004,2008-12-11,Master,IC,Oui,Oui
0000000005,2006-06-07,Master,SI,Non,Oui

I did a lot of research and found a function called drop(any). Which basically drops any rows that contains NULL value. I tried using it in the code below but it wont work

val x = sc.textFile("/home/amel/one")

val re = x.map(row => {
  val cols = row.split(",")
  val cycle = cols(2)
  val years = cycle match {
    case "License" => "3 years"
    case "Master" => "3 years"
    case "Ingeniorat" => "5 years"
    case "Doctorate" => "3 years"
    case _ => "other"
  }
  (cols(1).split("-")(0) + "," + years + "," + cycle + "," + cols(3), 1)
}).reduceByKey(_ + _)
re.collect.foreach(println)

This is the current result of my code:

(1999,3 years,Master,IC,57)
(2013,NULL,Doctorat,SI,44)
(2013,NULL,Licence,IC,73)
(2009,5 years,Ingeniorat,Null,58)
(2011,3 years,Master,Null,61)
(2003,5 years,Ingeniorat,Null,65)
(2019,NULL,Doctorat,SI,80)

However, I want the result to be like this:

(1999, 3 years, Master, IC)

I.e., any row that contains 'NULL' should be removed.

Jonathan Myers
  • 930
  • 6
  • 17
Amel ha
  • 93
  • 7

1 Answers1

0

Similar but not duplicate question as the following question on SO: Filter spark DataFrame on string contains

You can filter this RDD when you read it in.

val x = sc.textFile("/home/amel/one").filter(!_.toLowerCase.contains("null"))
Jonathan Myers
  • 930
  • 6
  • 17
  • Hi again. You have "Null" listed in the data. But you have "NULL" listed as the requirement. I have changed the code so that it is case-insensitive. – Jonathan Myers Jun 26 '19 at 15:29
  • Hey! Yes I have deleted my comment after the small mistake I made. Thank you – Amel ha Jun 26 '19 at 15:38