PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.
2 Answers
I see 2 options wrt Mleap:
1) implement dataframe based transformers and the SQLTransformer
-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126
2) extend the DefaultMleapFrame
with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing
subproject.
I actually went with 2) and added implode
, explode
and join
as methods to the DefaultMleapFrame
and also a HashIndexedMleapFrame
that allows for fast joins. I did not implement groupby
and agg
, but in Scala this is relatively easy to accomplish.

- 380
- 4
- 12
PMML and PFA are standards for representing machine learning models, not data processing pipelines. A machine learning model takes in a data record, performs some computation on it, and emits an output data record. So by definition, you are working with a single isolated data record, not a collection/frame/matrix of data records.
If you need to represent complete data processing pipelines (where the ML model is just part of the workflow) then you need to look for other/combined standards. Perhaps SQL paired with PMML would be a good choice. The idea is that you want to perform data aggregation outside of the ML model, not inside it (eg. a SQL database will be much better at it than any PMML or PFA runtime).

- 4,563
- 2
- 17
- 20
-
1PFA and MLeap are not restricted only for machine learning models. As per DMG, PFA is an emerging standard for statistical models and **data transformation** engines. Also within the mleap development, there are discussion regarding converting the existing row based transformation to frame based. Refer [here](https://github.com/combust/mleap/issues/126#issuecomment-310137673) – Gowrav Nov 26 '18 at 19:26
-
In this context, "data transformation" means feature engineering, not re-implementing SQL standard. For example, PMML comes with built-in aggregate functions (http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_Aggregate) but their scope is limited to that one data record (not a database). – user1808924 Nov 26 '18 at 20:53
-
To elaborate: "data transformation" != "data query". – user1808924 Nov 26 '18 at 20:58