0

I am trying to import an unstructured csv from datalake storage to databricks and i want to read the entire content of this file:sample.csv

EdgeMaster                          
Name    Value   Unit    Status  Nom.    Lower   Upper   Description
Type    A                                                  A
Date    1/1/2022                                           B
Time    0:00:00                                            A
X       1       m       OK      1       2       3          B
Y       -                                                  A
                            
EdgeMaster                          
Name    Value   Unit    Status  Nom.    Lower   Upper   Description
Type    B                                                  C
Date    1/1/2022                                           D
Time    0:00:00                                            C
X       1       m       OK      1       2       3          D
Y       -                                                  C

1. Method 1 : I tried reading the first line a header

df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load('abfss://xyz/sample.csv')

I get only this :Output1

2. Method 2: I skipped reading header

No improvements : Output2

3. Method 3: Defined a custom schema

Query returns no result: Output3

Ankit Sawa
  • 31
  • 5
  • that's a csv or an excel file ? – Steven Oct 28 '22 at 08:05
  • @Steven its a csv file – Ankit Sawa Oct 28 '22 at 10:11
  • then please, show input data as text so that anybody can simply copy/paste it instead of retranscript your image. [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) – Steven Oct 28 '22 at 12:49
  • @Steven i have made suggested edits – Ankit Sawa Oct 28 '22 at 14:50
  • the problem is that your data is not a csv. You can write that in a flat file and name it `my_file.csv`, that does not make it a csv file. CSV stands for comma separated value. It can have a header or not, values are separated with `,` (comma) in internationnal format, and strings are generally enquoted. But currently, your file is not a csv. So that normal if spark cannot read it. It does not respect csv standards. You're just trying to read an unstructured flat file. – Steven Oct 28 '22 at 15:11
  • @Steven i played with encoding (UTF-16) and delimeter ('\t) and was able to parse the file now. – Ankit Sawa Oct 31 '22 at 05:47

1 Answers1

0

If you know the schema ahead of time it should be possible to read the csv file and drop malformed data.

See this as an example:

name_age.csv

Hello
name,age
aj,19
Hello
name,age
test,20

And the code to read this would be:

>>> from pyspark.sql.types import StringType,IntegerType,StructField,StructType
>>> schema=StructType([StructField("name",StringType(),True),StructField("age",IntegerType(),True)])
>>> df=spark.read.csv("name_age.csv",sep=",",mode="DROPMALFORMED",schema=schema)
>>> df.show()
+----+---+
|name|age|
+----+---+
|  aj| 19|
|test| 20|
+----+---+

Other helpful link: Remove first and last row from the text file in pyspark

teedak8s
  • 732
  • 5
  • 12