Load file with schema information and dynamically apply to data file using Spark

Question

I don't want to use infer schema and headers options. The only way is I should read a file containing only column headers and should use it dynamically to create a dataframe.

I am using Spark 2 and for loading a single csv file with my user defined schema but I want to handle this dynamically so that once I provide the path of only the schema file it will read that and use it as headers for the data and convert it to dataframe with the schema provided in the schema file.

Suppose in the folder I have provided contains 2 files. One file will have only the data, header is not compulsory. The 2nd file will have the schema (column names). So I have to read the schema file first followed by the file containing data and have to apply the schema to the data file and show it in dataframe.

Small example, schema.txt contains:

Custid,Name,Product

while the data file have:

1,Ravi,Mobile

That should be possible to do. Where in implementing this are you having trouble exactly? — Shaido, Jun 27 '19 at 02:11
@Shaido After reading the schema file I want to apply it directly to the data so that even if schema is changing we can only pass the schema file as arguments during spark submit — Allforone, Jun 27 '19 at 03:04
Can you add an example of how this schema file looks like? A csv with the column names as a header? What about column types (supplied by the schema or inferred when reading)? — Shaido, Jun 27 '19 at 03:06
@shaido Yes suppose in the folder I am provided with 2 files. One file will have only the data header is not compulsory.2nd file will have the schema(columns name).So I have to read the schema file first followed by the file containing data and have to apply the schema to the data file and show it in dataframe. — Allforone, Jun 27 '19 at 03:10
Please add a small example of what the schema file can look like to the question. — Shaido, Jun 27 '19 at 03:22
@Shaido suppose we have schema.txt or any format like Custid,Name,Product and data file with 1,Ravi,Mobile.I have to read the schema.txt file and then populate it with the data file.Note: If we can read the schema file and then pass it during spark submit as an argument. — Allforone, Jun 27 '19 at 03:29
I added an answer below that I think should answer your question. It should be possible to pass the file with spark-submit as an argument or read it directly from the file system. — Shaido, Jun 27 '19 at 03:46

Shaido · Answer 1 · 2019-06-28T04:56:16.040

2

From your comments I'm assuming the schema file only contains the column names and is formatted like a csv file (with the columns names as header and without any data). The column types will be inferred from the actual data file and are not specified by the schema file.

In this case, the easiest solution would be to read the schema file as a csv, setting header to true. This will give an empty dataframe but with the correct header. Then read the datafile and change the default column names to the ones in the schema dataframe.

val schemaFile = ...
val dataFile = ...    

val colNames = spark.read.option("header", true).csv(schemaFile).columns
val df = spark.read
  .option("header", "false")
  .option("inferSchema", "true")
  .csv(dataFile)
  .toDF(colNames: _*)

edited Jun 28 '19 at 04:56

answered Jun 27 '19 at 03:40

Shaido

27,497
23
70
73

@Shiado thanks for the suggestion but I don't want to use header option as I have already mention in the post.If a file does not contains header then we should apply the schema on it and if the data contains header then we have to check it and apply only the schema that is provided in the schema.txt file.If possible we can read the schema file during submitting a job as an argument and in return it should automatically trigger the schema.txt file and apply it to the data – Allforone Jun 27 '19 at 03:46
@DeepakPanigrahi: It is necessary to know if the file have a header or not before reading it, there is no way to check this. Even if the file does not have an actual header and you set `header=true`, the first line will be made header. Spark have no way of knowing if the first line is header or data. – Shaido Jun 27 '19 at 03:48
so even if header is not there we have to enforce the schema from the schema.txt file and if we are having different files in different locations at that case how can we make it dynamic as each of the folder will comprise of data.txt and schema.txt .2nd folder may have different schema and different data so in one program how this can be achieved so that if the path changes then it should take the schema file from the said path and apply it to data any idea? – Allforone Jun 27 '19 at 04:00
@DeepakPanigrahi: The solution in this answer assumes all data files do **not** have a header. It will read a schema file with column names, read the data (again, without header) and then apply the column names in the schema file as dataframe column names. As you can see in the answer, both the `schemaFile` and `dataFile` should be specified - these two should in your case be the data file and schema file from the specific path. When the path changes, you just need to rerun the code to load the new data (with the new schema). – Shaido Jun 27 '19 at 05:24
@Allforone are these actual requirement for professional software development? What you are asking here is to write a code that will figure out an event in future and execute accordingly, prior to its execution? Let me know if i get the requirements correct on this one? – Aaron Jun 27 '19 at 16:53
@Shaido Thanks for the suggestion its applying the schema dynamically but the data types are by default showing as string I have used case for converting the data types but no luck any suggestions on this how to change data types dynamically without using withcolumn in spark – Allforone Jun 28 '19 at 04:31
@Aaron yes you are correct when we are fetching data from a database there may be some cases where schema may not be supported in spark and it may result to ambiguous databtype so while loading the info it should validate the schema dynamically and should covert it as per the database standards but now the problem is even though we are imposing the schema dynamically but the data types are showing as string by default even though it's of type int – Allforone Jun 28 '19 at 04:52
@Allforone: The data types should be inferred from the values in each respective column, if the data file have a header then all columns will be strings in this case (since the header is ignored and if a column have mostly numbers but some strings, the type will be string). You can add `.option("inferSchema", "true")` when reading the data file to be sure the types are automatically inferred (added in the answer above). – Shaido Jun 28 '19 at 04:55
@Shaido is there any way to cast them without using infer schema for example my schema file contain -: PID:Int,Name: String and my datafile contains : 1,Phone.How to dynamically apply those such that 1 is casted to int without using infer schema. – Allforone Jun 28 '19 at 05:01
@Allforone: It should be possible by building an actual schema that can be applied when reading the data file, see e.g.: https://stackoverflow.com/questions/39926411/provide-schema-while-reading-csv-file-as-a-dataframe. Creating the scehma dynamically from a file shouldn't be too hard (maybe you can consider a json file?). You can try it out and if you have problems you can create a new question and link to it here (since it's quite different from the problem in this question). – Shaido Jun 28 '19 at 05:10

Load file with schema information and dynamically apply to data file using Spark

1 Answers1