In Spark 1.6 , How to read a CSV file with duplicated column name

Question

I am unable to find a solution for reading a CSV file which has a column name repeated twice but while reading the CSV file it's giving an error complaining duplicate column names

Is there a way to handle this in spark without altering the CSV file ?.

My CSV data looks like this delimited by Tab (\t) & some extra spaces in each column.

col1    col2  col3
  2020  100   sometext

@RameshMaharjan, If i provide custom schema, it complains me of data validation errors . Any idea why it's like that ?. — serverliving.com, Jul 03 '18 at 05:42
check this, https://stackoverflow.com/questions/33816481/duplicate-columns-in-spark-dataframe — Kishore, Jul 03 '18 at 06:01
something like this even if data types are correct due to spaces i guess --> Caused by: java.lang.NumberFormatException: For input string: " 20511" — serverliving.com, Jul 03 '18 at 06:01
checkout https://stackoverflow.com/questions/47021073/spark-sql-removing-white-spaces for dealing with such space issues — Ramesh Maharjan, Jul 03 '18 at 06:10
@RameshMaharjan I tried those 2 options to ignore leader/trailing whitespaces. Still same NumberFormatException — serverliving.com, Jul 03 '18 at 06:17
update the question with input samples, your tryings and the error message please — Ramesh Maharjan, Jul 03 '18 at 06:47

score 1 · Answer 1 · answered Jul 03 '18 at 06:06

1

You can also try using textfile method to read csv files and then convert them to DF or use them as RDDs after splitting and mapping them back!

Hope this works!

answered Jul 03 '18 at 06:06

Vihit Shah

314
1
5

In Spark 1.6 , How to read a CSV file with duplicated column name

1 Answers1