-1

Is there a way to read from Glue catalog using structured streaming? When I do something like this:

sparkSession
        .catalog()
        .createTable("test", "s3n://test-bucket/data/")
        .as(Encoders.bean(dataType))
        .writeStream()
        .outputMode(OutputMode.Append())
        .format("parquet")
        .option("path", outputFolder.getRoot().toPath().toString())
        .option("checkpointLocation", checkpointFolder.getRoot().toPath().toString())
        .queryName("test-query")
        .start();

I get error org.apache.spark.sql.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFrame;

Update

Code snippet and exception in question are not related to actual question. I'd like to know if there is a way to use Glue catalog as a source for structured streaming in spark

Community
  • 1
  • 1
Yuriy Bondaruk
  • 4,512
  • 2
  • 33
  • 49
  • 2
    Your error seems unrelated to Glue. You're reading from a static bucket, and treating it as if it were a stream – OneCricketeer Feb 03 '18 at 17:40
  • Sure. But my goal is to read from Glue catalog using structured streaming so I started experimenting with this variant. That's why I wrote question "Is there a way to read from Glue catalog using structured streaming?" – Yuriy Bondaruk Feb 03 '18 at 17:45
  • 2
    Glue is not a stream. Kinesis is a stream – OneCricketeer Feb 03 '18 at 17:50
  • S3 is not a stream as well but can be used as a source for structured streaming. Moreover I'm not talking about Glue but Glue Catalog as a warehouse – Yuriy Bondaruk Feb 03 '18 at 17:54
  • I feel like you're looking for this https://stackoverflow.com/questions/31980584/how-to-connect-to-a-hive-metastore-programmatically-in-sparksql – OneCricketeer Feb 03 '18 at 18:01
  • 2
    Anyway, I've personally never seen a database used as a source of structured streaming with Spark. S3 works because it's a file system source, which is listed first in the documentation. The only way I imagine this working is to have a CDC process capturing database changes, and creating a stream. – OneCricketeer Feb 03 '18 at 18:06
  • The link you sent is not really what I'm looking for. We have a Glue ETL that writes data to S3 in Glue Catalog. I need to consume this data. Since [glue catalog can be used as a metastore for hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html) I can set it up as a data source. But again, I don't know if that is possible for structured streaming. That is why I asked my question. And I don't understand why to mark it as unclear if you don't know answer. – Yuriy Bondaruk Feb 03 '18 at 18:23
  • 2
    The fact that you don't know if it's possible makes it unclear. You're under the assumption that a catalog acts like a stream of data over which a Spark dataframe can continously be applied over a period of time, which is simply not the case – OneCricketeer Feb 03 '18 at 18:33
  • Your question is no different than using a plain Hive metastore... So, since it's a drop in replacement, then that's why I linked to it. You point a SQL query at a Hive metastore, not a stream – OneCricketeer Feb 03 '18 at 18:35
  • Yes I had assumption that Glue catalog can be used for streaming since it's not hive. I knew that I could get data with hive support but not in structured streaming. That's why my question raised related to glue catalog and not hive. Your comment about my assumption regarding glue catalog actually answers my question – Yuriy Bondaruk Feb 03 '18 at 18:51

1 Answers1

1

The exception should give you all information you need:

'writeStream' can be called only on streaming Dataset/DataFrame;

Registered tables are not streaming source, therefore cannot be treated as streams.

You can implement your own source, with support for streaming reads, but considering that tables don't have any notification mechanism, there is no efficient way to do it, unless schema itself, provides means for managing efficient time based queries (timestamps and time based partitioning for example).

Since you're data comes to S3 bucket, it makes more sense to use it as a source directly.

  • The reason of the exception is clear. I was wondering if there is a way to use a Glue catalog as a stream source since it's not hive (however similar). BTW my code snippet is not really appropriate to my question (it's how I started experimenting). Consuming directly from s3 was my alternative option which I'll probably implement. – Yuriy Bondaruk Feb 03 '18 at 19:07
  • @YuriyBondaruk It is not possible to read (as stream) from Hive table either. – Alper t. Turker Feb 03 '18 at 23:04
  • Yep I know it. That's why I asked about Glue catalog. Otherwise I would use hive to read data since it [can be used as a metastore for hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html) – Yuriy Bondaruk Feb 04 '18 at 02:36