I'm pretty new to pySpark and I am trying to work with some meter data from an electric meter interval dataset (csv) I have access to.
I have a dataframe schema created from a CSV import of electric meter data that looks something like this:
root
|-- _c0: string (nullable = true)
|-- _c1: integer (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: integer (nullable = true)
|-- _c5: string (nullable = true)
|-- _c6: long (nullable = true)
|-- _c7: string (nullable = true)
|-- _c8: string (nullable = true)
|-- _c9: string (nullable = true)
|-- _c10: string (nullable = true)
|-- _c11: double (nullable = true)
|-- _c12: integer (nullable = true)
|-- _c13: integer (nullable = true)
|-- _c14: long (nullable = true)
|-- _c15: string (nullable = true)
|-- _c16: double (nullable = true)
|-- _c17: long (nullable = true)
|-- _c18: string (nullable = true)
|-- _c19: double (nullable = true)
|-- _c20: long (nullable = true)
|-- _c21: string (nullable = true)
|-- _c22: double (nullable = true)
|-- _c23: long (nullable = true)
|-- _c24: string (nullable = true)
|-- _c25: double (nullable = true)
|-- _c26: long (nullable = true)
|-- _c27: string (nullable = true)
|-- _c28: double (nullable = true)
...
_c13 contains a number which indicates how many groupings of 3 data columns follow it. (Representing a timestamp, a flag, and a value) varies from 1 to 96 on each record
(i.e. timestamp, flag, value)
_c14, _c15, _c16 = Group 1 (202010101315, "NONE", 1.1)
_c17, _c18, _c19 = Group 2 (202010101330, "NONE", 1.2)
_c20, _c21, _c22 = Group 3 (202010101345, "EST", 0.75) etc...
My final output that I believe I want for a AWS Redshift table is a single row for each grouping from the source dataframe _c0 to _c12 would be output with each grouping
+-----+----+ +------+-------------+------+-------+
_c0 _c1 ... | _c12 | timestamp | flag | value |
+-----+----+ +------+-------------+------+-------+
A B ... L 202010101315 NONE 1.1
A B ... L 202010101330 NONE 1.2
A B ... L 202010101345 EST 0.75
etc...
So far, I have managed to load my data into a dataframe. To iterate over each row, I realized I can create an RDD with a custom function to perform manipulation on the row:
rdd = df.rdd.map(customFunction)
but I quickly realized that I can only return a single grouping to the RDD
Then, I looked at appending a row to a new dataframe from within the customFunction, but after reading that the dataframe is immutable and a new dataframe is returned with each append, I realized that that is probably not that efficient.
Any help on a basic structure to achieve the split of the recordsI'm looking for efficiently would be appreciated!