Spark dataframe from csv count, return wrong result

Question

I open several "csv" files in Spark 2.2, but when I do a "count" it returns 10000000 of records when in reality it is 6000000 of records, when I check it with Pandas in python or Alteryx it gives the correct number.

  scala> val df=spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("encoding", "UTF-8").load("/detalle/*.csv")
  df: org.apache.spark.sql.DataFrame = [KEY: string, UNIQ: string ... 101 more fields]

  scala> df.count
  res13: Long = 10093371

It is likely that your data contains embedded newline characters. — 10465355, Nov 26 '18 at 20:50
The code of the lower part has been introduced, in version 2.3 although it is about 1000 difference records, but in version 2.2, this is the difference. `val df = spark.read.option("wholeFile", true).option("multiline",true).option("header", true).option("inferSchema", "true").option("delimiter", ",").option("mode", "DROPMALFORMED").csv("/detalle/*.csv")` — Mat.cort, Nov 26 '18 at 23:40

score -2 · Answer 1 · answered Nov 27 '18 at 00:54

-2

Despues de mucho buscar y probar, encontre la respuesta en este post:

Reading csv files with quoted fields containing embedded commas

La linea final quedo de la siguiente forma:

  val df = spark.read.format("com.databricks.spark.csv").option("wholeFile", true).option("multiline",true).option("header", true).option("inferSchema", "true").option("delimiter", ",").option("encoding", "ISO-8859-1").option("charset", "ISO-8859-1").option("quote", "\"").option("escape", "\"").load("*.csv")

Thanks!

answered Nov 27 '18 at 00:54

Mat.cort

37
6

1

[so] is an English-only site. Please post in English. [See here](https://meta.stackexchange.com/q/13676/204869), [here](https://meta.stackoverflow.com/a/262054/1402846), and [here](https://blog.stackoverflow.com/2009/07/non-english-question-policy) for details. Thank you. – Pang Nov 27 '18 at 01:37

Spark dataframe from csv count, return wrong result

1 Answers1