Make The 3rd row as Header of Dataframe

Question

I have a data in a csv as below, the first row is blank and the second row is filled for only the 4 columns as below,

        201901                201902           201903       201904
A   X   1           0       1       1
B   Y   0           0       1       1
A   Z   1           0       1       1
B   X   1           0       1       1
A   Y   0           0       0       1
B   Z   1           0       0       1
A   X   0           1       0       1
B   Y   1           1       0       0
A   Z   1           1       0       0
B   X   0           1       1       0

If I read the data into csv i will get the data as below

_c1     _c2     _c3         _c4     _c5     _c6
null           null        null            null       null  null
null           null       201901                201902          201903  201904
A       X       1           0       1       1
B       Y       0           0       1       1
A       Z       1           0       1       1
B       X       1           0       1       1
A       Y       0           0       0       1
B       Z       1           0       0       1
A       X       0           1       0       1
B       Y       1           1       0       0
A       Z       1           1       0       0
B       X       0           1       1       0

I have read the datafile without header and removed the not required headers. Now I want to convert the files to have header

df=spark.read.csv("s3://abc/def/file.csv",header=False)
df=df.where(col("_c3").isNotNull())

Type         Source         201901                 201902           201903  201904
A       X       1           0       1       1
B       Y       0           0       1       1
A       Z       1           0       1       1
B       X       1           0       1       1
A       Y       0           0       0       1
B       Z       1           0       0       1
A       X       0           1       0       1
B       Y       1           1       0       0
A       Z       1           1       0       0
B       X       0           1       1       0

what is your effort on it? please show your effort of how its not working... SO is not code writing service for projects. please add your code snippets and test data. — Ram Ghadiyaram, May 28 '19 at 15:10
Hi Ram, The test data is provided as above. As given above, I am trying to get ideas as to how to make the row as header but not getting. As I am not able to get the Ideas I have asked the same here. Apologies for not being clear. — Kumar P, May 28 '19 at 15:15
@KumarP Your question is being downvoted because it is either considered poorly formatted or generally unhelpful to the rest of the community. I recommend reading through the posts in [Help Center](https://stackoverflow.com/help), particularly [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) topic. — Matthew, Jun 06 '19 at 18:20
https://stackoverflow.com/questions/27772805/need-a-regex-to-remove-everything-except-numbers -- Kindly look at this question. Its nor formatted and answer can be got from google. Still its upvoted. Question might be easy or not useful to people who already know answer but its useful to people who ask. So my kind request don't downvote to show that you have high reputation. — Kumar P, Jun 08 '19 at 17:39
when you say its not helpful to community. it means the asker is not considered the part of community. Its a properly formatted with example question and still downvoted. Which I feel is wrong. — Kumar P, Jun 08 '19 at 17:40

score -2 · Answer 1 · answered May 28 '19 at 23:35

you can create a custom schema by defining it something like this

val customSchema = StructType(Array(
    StructField("yourcolumnheader", StringType, true),
    StructField("yourcolumnheader2", StringType, true),
    StructField("yourcolumnheader3", IntegerType, true),
    StructField("yourcolumnheader4", DoubleType, true)))

then use that schema when you read your formatted ie 3 rows removed CSV file

df=spark.read.csv("s3://abc/def/file.csv",header=False)
.schema(customeSchema)

hope that answers your question.

Make The 3rd row as Header of Dataframe

1 Answers1