1

I am using the data source API to load Dataframe from custom database, our database allow special operations which are not presented in the core SQL language. One example is optimization of the DISTINCT operation, which is performed super fast.

I want to be able to write query which has custom column expressions/operations like:

select MY_DISTINCT name, lastname from users

The way I see it, it can also be a custom filter

select name, lastname from users where name%%*1

And %%*1 will be passed to buildScan which I will handle separately in my database before I return the Dataframe.

Is this extension possible in Spark? the only relevant documentation I could find is phatak-dev Github but it is very minimal and doesn't show the connection to the SQL query.

EDIT: I am looking for a way to add extra expressions and handle them from withing the data source API

antonpuz
  • 3,256
  • 4
  • 25
  • 48

0 Answers0