3

PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.

Gowrav
  • 627
  • 7
  • 22

2 Answers2

0

I see 2 options wrt Mleap:

1) implement dataframe based transformers and the SQLTransformer-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126

2) extend the DefaultMleapFrame with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing subproject.

I actually went with 2) and added implode, explode and join as methods to the DefaultMleapFrame and also a HashIndexedMleapFrame that allows for fast joins. I did not implement groupby and agg, but in Scala this is relatively easy to accomplish.

Elmar Macek
  • 380
  • 4
  • 12
-1

PMML and PFA are standards for representing machine learning models, not data processing pipelines. A machine learning model takes in a data record, performs some computation on it, and emits an output data record. So by definition, you are working with a single isolated data record, not a collection/frame/matrix of data records.

If you need to represent complete data processing pipelines (where the ML model is just part of the workflow) then you need to look for other/combined standards. Perhaps SQL paired with PMML would be a good choice. The idea is that you want to perform data aggregation outside of the ML model, not inside it (eg. a SQL database will be much better at it than any PMML or PFA runtime).

user1808924
  • 4,563
  • 2
  • 17
  • 20
  • 1
    PFA and MLeap are not restricted only for machine learning models. As per DMG, PFA is an emerging standard for statistical models and **data transformation** engines. Also within the mleap development, there are discussion regarding converting the existing row based transformation to frame based. Refer [here](https://github.com/combust/mleap/issues/126#issuecomment-310137673) – Gowrav Nov 26 '18 at 19:26
  • In this context, "data transformation" means feature engineering, not re-implementing SQL standard. For example, PMML comes with built-in aggregate functions (http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_Aggregate) but their scope is limited to that one data record (not a database). – user1808924 Nov 26 '18 at 20:53
  • To elaborate: "data transformation" != "data query". – user1808924 Nov 26 '18 at 20:58