Reading a file through sqlContext's sqlContext.read.csv() method works well. As it has many in built-in methods available where you can pass parameters and control the execution. But working on spark versions prior to 1.6 may not have this available. So you may also do it by spark-context's textFile method.
Val a = sc.textFile("file:///file-path/fileName")
This gives you an RDD[String]. So you have created the RDD now and you want to convert this to a dataframe.
Now go ahead and define the schema for your RDD using StructTypes. This allows you to have as many number of StructFields as you may need.
val schema = StructType(Array(StructField("fieldName1", fieldType, ifNullablle),
StructField("fieldName2", fieldType, ifNullablle),
StructField("fieldName3", fieldType, ifNullablle),
................
))
You now have two things: 1) RDD, which we created using textFile method.
2) Schema, with required number of attributes.
Next step is definitely to map this schema with your RDD right !
You may observe that the RDD you have is a single String, i.e. RDD[String]. But what you actually want to do with this is to convert it into those many number of variables for which you created the schema. So why not split your RDD based on comma. The following expression should do this using a map operation.
val b = a.map(x => x.split(","))
you get an RDD[Array[String]] on evaluation.
But you may say that this Array[String] is still not that intuitive that I may apply any operation.
So there comes the Row API to your respite. Get it imported using import org.apache.spark.sql.Row
and we'll actually be mapping your splitted RDD with Row object as a tuple. See this :
import org.apache.spark.sql.Row
val c = b.map(x => Row(x(0), x(1),....x(n)))
The above expression gives you an RDD where each and every element is a Row. You just need to give it a schema now. Again sqlContext's createDataFrame method does the job for you so simply.
val myDataFrame = sqlContext.createDataFrame(c, schema)
This method takes two parameters: 1) The RDD you need to work on.
2) The schema you want to apply on top of it.
Resulting evaluation is the DataFrame object.
So finally we now have our DataFrame object myDataFrame created. And if you use the show method on your myDataFrame, you get to see the data in tabular format.
You are now good to perform any spark-sql operation on it.