0

Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess.

Does Spark have something like this already built-in? Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables.

How do you normally determine data types in Spark?

P.S. Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be a categorical variable or numerical.

P.P.S. Another use case is if you get an arbitrary data set (we get them quite often), and want to save as a Parquet table. Providing correct data types make parquet more space effiecient (and probably more query-time performant, e.g. better parquet bloom filters than just storing everything as string/varchar).

zero323
  • 322,348
  • 103
  • 959
  • 935
Tagar
  • 13,911
  • 6
  • 95
  • 110
  • That `UDF` doesn't aggregate anything. – o-90 Sep 22 '15 at 19:31
  • That was just an idea how datatype guessing logic of that UDAF might work. It's true that is not UDAF. Thanks. – Tagar Sep 22 '15 at 21:13
  • I guess what I'm saying is why do you need to aggregate the data to determine what the type of the variable is? Just do what is in that java UDF, but write it in scala. – o-90 Sep 22 '15 at 21:17
  • That's a valid and open question. I think that UDAF way it may look more elegantly. And essintially data type guessing is an aggregate function - it ruturns one value for a set (all) values of a specific table column. – Tagar Sep 22 '15 at 22:11

1 Answers1

3

Does Spark have something like this already built-in?

Partially. There are some tools in Spark ecosystem which perform schema inference like spark-csv or pyspark-csv and category inference (categorical vs. numerical) like VectorIndexer.

So far so good. Problem is that schema inference has limited applicability, is not an easy task in general, can introduce hard to diagnose problems and can be quite expensive:

  1. There are not so many formats which can be used with Spark and may require schema inference. In practice it is limited to different variants of CSV and Fixed Width Formatted data.
  2. Depending on a data representation it can be impossible to determine correct data type or inferred type can lead to information loss:

    • interpreting numeric data as float or double can lead to unacceptable loss of precision, especially if working with financial data
    • date or number formats can differ based on a locale
    • some common identifiers can look like numerics while having some internal structure which can lost in conversion
  3. Automatic schema inference can mask different problems with input data and if it is not supported by additional tools which can highlight possible issues it can be dangerous. Moreover any mistakes during data loading and cleaning can be propagated through complete data processing pipeline.

    Arguably we should develop good understanding of input data before we even start to think about possible representation and encoding.

  4. Schema inference and / or category inference may require full data scan and / or large lookup tables. Both can be expensive or even not feasible on large datasets.

Edit:

It looks like schema inference capabilities on CSV files have been added directly to Spark SQL. See CSVInferSchema.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • thank you @zero323. Reference to spark-csv / pyspark-csv was most useful. That seems it might be it. Although we have a lot of use cases for pyspark-csv where source files are parquet files.. Would it be possible for pyspark-csv to add a parquet table or an arbitrary data frame? I just created an issue https://github.com/seahboonsiew/pyspark-csv/issues/8 If there is possible, I will accept this as the correct answer. Thank you again. – Tagar Feb 13 '16 at 16:07
  • 1
    Do you mean you want to read Parquet with strings and use the same logic as `pyspark-csv` to convert columns to different types? – zero323 Feb 13 '16 at 19:54
  • Yep, exactly. We have source parquet tables with hundreds of columns (some up to 800) and it would be great to use something like pyspark-csv to convert to a dataframe with correct data types. It is okay to read the whole data set (hundreds of millions of records) to do a correct data types "guess". thank you. – Tagar Feb 13 '16 at 19:58
  • 1
    Take a look at this: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala although it is not in release yet. – zero323 Feb 15 '16 at 03:05
  • Thank you. It looks to me they will merge spark-csv into spark trunk? If it's the case, then https://github.com/databricks/spark-csv/issues/264 still relevant. – Tagar Feb 15 '16 at 17:18
  • Looks like it. I haven't search for corresponding JIRA though. – zero323 Feb 15 '16 at 22:33
  • It seems possible to do with spark-csv per HyukjinKwon https://github.com/databricks/spark-csv/issues/264 by converting dataframe to csv and then using spark-csv back to dataframe (ugh ). I will probably have to write my own pyspark code. Will mark this as correct answer as there are no other answers and there is a workaround to make spark-csv work. thank you. – Tagar Feb 18 '16 at 18:02
  • @Tagar can you share the pyspark code which you wrote ? – pratiklodha Feb 03 '18 at 15:52