I'm having a CSV file which I want to read into an RDD or DataFrame. This is working so far, but if I collect the data and convert it into a pandas DataFrame for plotting the table is "malformed".
Here is how I read the CSV file:
NUMERIC_DATA_FILE = os.path.join(DATA_DIR, "train_numeric.csv")
numeric_rdd = sc.textFile(NUMERIC_DATA_FILE)
numeric_rdd = numeric_rdd.mapPartitions(lambda x: csv.reader(x, delimiter=","))
numeric_df = sqlContext.createDataFrame(numeric_rdd)
numeric_df.registerTempTable("numeric")
The result looks like this:
Is there an easy way to correctly set the first row of the CSV data to columns and the first column as index?
This problem goes further as I try to select data from the DataFrame
:
numeric_df.select("SELECT Id FROM numeric")
which gives me:
AnalysisException: u"cannot resolve 'SELECT Id FROM numeric' given input columns _799, _640, _963, _70, _364, _143, _167,
_156, _553, _835, _780, _235, ...