1

Suppose I have the following tables, in an Oracle DB

Foo:

+--------+---------+---------+
| id_foo | string1 | string2 |
+--------+---------+---------+
|      1 | foo     | bar     |
|      2 | baz     | bat     |
+--------+---------+---------+

Bar:

+--------+-----------+--------+
| id_bar | id_foo_fk | string |
+--------+-----------+--------+
|      1 |         1 | boo    |
|      2 |         1 | bum    |
+--------+-----------+--------+

When I insert into Foo, by using a Dataset and JDBC, such as

Dataset<Row> fooDataset = //Dataset is initialized
fooDataset.write().mode(SaveMode.Append).jdbc(url, table, properties)

an ID is auto-generated by the database. Now when I need to save Bar, using the same strategy, I want to be able to link it to Foo, via id_foo_fk.

I looked into some possibilities, such as using monotonically_increasing_id() as suggested in this question, but it won't solve the issue, as I need the ID generated by the database. I tried what was suggested in this question, but it leads to the same issue, of unique non-database IDs

It's also not possible to select from the JDBC again, as string1 and string2 may not be unique. Nor is it possible to change the database. For instance, I can't change it to be UUID, and I can't add a trigger for it. It's a legacy database that we can only use

How can I achieve this? Is this possible with Apache Spark?

LeoColman
  • 6,950
  • 7
  • 34
  • 63
  • Do I understand correctly that there is no link between `Foo` and `Bar` that can be inferred from the data alone? – 10465355 Feb 07 '19 at 12:47
  • Correct, there isn't any link other than the foreign key – LeoColman Feb 07 '19 at 12:48
  • So forgive a silly question - but how do you know that records are actually linked in Spark? It means that at some point information in Foo and Bar has to form a single row, right? – 10465355 Feb 07 '19 at 12:50
  • To create a `Bar`, I need a `Foo`, because I need the `Foo_ID`. So when I'm creating `Bar`, I created one instance of `Foo` before it, and this instance contains an autogenerated ID from the database. This ID is needed to create Bar. – LeoColman Feb 07 '19 at 12:52
  • Well, I got this point, but there I feel like there is a missing piece of a puzzle. Let's say I have a tuple `(foo, bar)`. I perform insert, and get back PK `1`. Then I want to perform insert into `Bar`. I have `id_foo_fk`, `id_bar` will be generated by the database, but where `string` (here "foo") comes from? – 10465355 Feb 07 '19 at 12:57
  • From bar itself, when I'm creating bar. It's unrelated to `foo`. Sorry for that confusion, I'll update the question to include that. – LeoColman Feb 07 '19 at 12:58
  • Foo and Bar are only related by the foreign key – LeoColman Feb 07 '19 at 12:59
  • I haven't worked on spark but from jdbc perspective , I think , db trigger is the only way out when `id_foo` is auto generated. Point is if `id_foo` is not collected just after insertion, it will be lost forever. – Sabir Khan Feb 07 '19 at 14:16
  • I'm not a java/jdbc developer, but if you're using an insert statement to store the data in the database, there's an optional returning clause you can use to fetch data back out, e.g. sequence numbers. – Boneist Feb 07 '19 at 15:02
  • I'm not able so far to do that using Spark. That's what I need – LeoColman Feb 07 '19 at 15:27
  • Please add an example of your Spark data. How are the two objects connected in Spark? – Marmite Bomber Feb 07 '19 at 17:04
  • They aren't exactly connected. I'll create a `Foo`, after it I'll create a `Bar`. But to create the `Bar` I'll need `Foo`'s ID, which I planned to add with a `withColumn("foo_id", lit(fooID))` or something similar – LeoColman Feb 07 '19 at 18:12
  • Can you modify your Oracle database or not (a temporary table?)? I was thinking of adding a "temporary" id to Foo from Spark, so you can insert data in Foo, let Oracle generate the right id, then retrieve the data, get the Oracle id, then work on Bar... lengthy, but you need to get what happens in Oracle, so... – jgp Feb 07 '19 at 18:38
  • The problem is that there are no unique keys other than the PK in Oracle. The modelling is pretty bad. I guess this is way too specific problem for StackOverflow. I was hoping that maybe I could use what the JDBC driver returned to me, but I guess that's not possible – LeoColman Feb 07 '19 at 18:43

1 Answers1

0

I'm not a Java specialist so you will have to look into the database layer on how to proceed exactly but there are 3 ways you can do this:

  • You can create a store procedure if the database server you are using is capable of (most do) and call it from your code.
  • Create a trigger that returns the id number on the first insertion and use it in your next DB insertion.
  • Use UUID and use this as the key instead of the database auto generated key.
10465355
  • 4,481
  • 2
  • 20
  • 44
Pat Rick
  • 9
  • 2
  • Unfortunately, it's not possible to change the database (I"ve added this to the question aswell) – LeoColman Feb 07 '19 at 12:38
  • oup sorry did not read until the end then the only option I see is to query the table before inserting the other record in the other table. There is a transaction option with oracle that allows you to wrap all the queries. Use the Max operator to get the latest id inserted. Using the transaction features allows you to commit or roll back if there is any problem. – Pat Rick Feb 07 '19 at 12:40
  • I don't believe this will actually solve my use case, but perhaps it might for future users. – LeoColman Feb 07 '19 at 12:42