Questions tagged [spark-csv]

A library for handling CSV files in Apache Spark.

External links:

139 questions
316
votes
17 answers

How to show full column content in a Spark Dataframe?

I am using spark-csv to load data into a DataFrame. I want to do a simple query and display the content: val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("my.csv") df.registerTempTable("tasks") results =…
tracer
  • 3,265
  • 2
  • 15
  • 6
170
votes
16 answers

Write single CSV file using spark-csv

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder. Need a Scala function which will take parameter like path and file name and write that CSV file.
user1735076
  • 3,225
  • 7
  • 19
  • 16
84
votes
13 answers

Provide schema while reading csv file as a dataframe in Scala Spark

I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below. val pagecount =…
Pa1
  • 861
  • 1
  • 7
  • 6
25
votes
2 answers

How to estimate dataframe real size in pyspark?

How to determine a dataframe size? Right now I estimate the real size of a dataframe as follows: headers_size = key for key in df.first().asDict() rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum() total_size =…
TheSilence
  • 342
  • 1
  • 3
  • 11
20
votes
7 answers

How to read only n rows of large CSV file on HDFS using spark-csv package?

I have a big distributed file on HDFS and each time I use sqlContext with spark-csv package, it first loads the entire file which takes quite some time. df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',…
Abhishek
  • 3,337
  • 4
  • 32
  • 51
15
votes
2 answers

How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001)…
user2535982
14
votes
3 answers

Can I read a CSV represented as a string into Apache Spark using spark-csv

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible?
Gary Sharpe
  • 2,369
  • 8
  • 30
  • 51
13
votes
1 answer

inferSchema in spark-csv package

When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column? I have the following csv file Name,Department,years_of_experience,DOB Sam,Software,5,1990-10-10 Alex,Data…
sag
  • 5,333
  • 8
  • 54
  • 91
9
votes
2 answers

Spark fails to read CSV when last column name contains spaces

I have a CSV that looks like this: +-----------------+-----------------+-----------------+ | Column One | Column Two | Column Three | +-----------------+-----------------+-----------------+ | This is a value | This is a value | This is…
Sam Malayek
  • 3,595
  • 3
  • 30
  • 46
8
votes
2 answers

How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?

I use Spark 2.2.0 I am reading a csv file as follows: val dataFrame = spark.read.option("inferSchema", "true") .option("header", true) .option("dateFormat", "yyyyMMdd") …
Rami
  • 8,044
  • 18
  • 66
  • 108
8
votes
1 answer

Spark schema from case class with correct nullability

For a custom Estimator`s transformSchema method I need to be able to compare the schema of a input data frame to the schema defined in a case class. Usually this could be performed like Generate a Spark StructType / Schema from a case class as…
8
votes
2 answers

Getting NullPointerException using spark-csv with DataFrames

Running through the spark-csv README there's sample Java code like this import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.types.*; SQLContext sqlContext = new SQLContext(sc); StructType customSchema = new StructType( new…
Dennis Huo
  • 10,517
  • 27
  • 43
7
votes
1 answer

Add UUID to spark dataset

I am trying to add a UUID column to my dataset. getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toString())).show(false); But the result is all the rows have the same UUID. How can i make it…
Adiant
  • 859
  • 4
  • 16
  • 34
7
votes
3 answers

Spark DataFrame handing empty String in OneHotEncoder

I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder, the application crashes with error requirement failed: Cannot have an empty string for name.. Is there a way I can get around…
6
votes
3 answers

Is there an explanation when spark-csv won't save a DataFrame to file?

dataFrame.coalesce(1).write().save("path") sometimes writes only _SUCCESS and ._SUCCESS.crc files without an expected *.csv.gz even on non-empty input DataFrame file save code: private static void writeCsvToDirectory(Dataset dataFrame, Path…
Makrushin Evgenii
  • 953
  • 2
  • 9
  • 20
1
2 3
9 10