53

I would like to read in a file with the following structure with Apache Spark.

628344092\t20070220\t200702\t2007\t2007.1370

The delimiter is \t. How can I implement this while using spark.read.csv()?

The csv is much too big to use pandas because it takes ages to read this file. Is there some way which works similar to

pandas.read_csv(file, sep = '\t')

Thanks a lot!

samthebest
  • 30,803
  • 25
  • 102
  • 142
inneb
  • 1,060
  • 1
  • 9
  • 20

3 Answers3

97

Use spark.read.option("delimiter", "\t").csv(file) or sep instead of delimiter.

If it's literally \t, not tab special character, use double \: spark.read.option("delimiter", "\\t").csv(file)

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
6

This works for me and it is much more clear (for me): As you mentioned, in pandas you would do:

df_pandas = pandas.read_csv(file_path, sep = '\t')

In spark:

df_spark = spark.read.csv(file_path, sep ='\t', header = True)

Please note that if the first row of your csv are the column names, you should set header = False, like this:

df_spark = spark.read.csv(file_path, sep ='\t', header = False)

You can change the separator (sep) to fit your data.

Tom
  • 496
  • 8
  • 16
0

If you are using SparkSQL, you can use the DDL below with the OPTION syntax to specify your delimiter.

CREATE TABLE sample_table
USING CSV
OPTIONS ('delimiter'='\t')
AS SELECT ...

SparkSQL Documentation