How to add the column name automatically when convert the RDD to Rows？

Asked Aug 30 '21 at 14:12

Active Aug 30 '21 at 14:12

Viewed 104 times

https://spark.apache.org/docs/latest/sql-getting-started.html#interoperating-with-rdds

# Load a text file and convert each line to a Row.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

This is an example on the website, but if I have thousands of columns, do I need to add the name one by one manually? like that:

airports_rdd_row = parts.map(lambda p: Row(IATA_CODE=p[0], 
                                            AIRPORT=p[1],
                                            CITY=p[2],
                                            STATE=p[3],
                                            COUNTRY=p[4],
                                            LATITUDE=p[5],
                                            LONGITUDE=p[6]
                                          ))

asked Aug 30 '21 at 14:12

Trinidad

How does your file look? What do you want to achieve? – Robert Kossendey Aug 30 '21 at 14:18
I just don't want to set the name manually using p[0], p[1]... The first row of the file is the column name. – Trinidad Aug 30 '21 at 14:42
How does your original file `people.txt` look like? Could you load the file as [csv](https://stackoverflow.com/a/29705881/2129801)? – werner Aug 30 '21 at 18:53
The file is a csv file with header in the first line. people.txt is just an example from the official. – Trinidad Aug 31 '21 at 07:37
why don't you use a dataframe (`spark.read.csv('people.txt')`) ? – werner Sep 06 '21 at 15:03
because I want to compare the RDD with dataframe. – Trinidad Sep 10 '21 at 08:01

How to add the column name automatically when convert the RDD to Rows？

0 Answers0