Track which files has been read and loaded into db

Question

Use case: After app starts it reads json files maps it and save it as rows in db. In one json files there are multiple data which has to be inserted into db. New file mean new rows have to be added into db - there would not be any data duplication between files.

Problem: When app restarts I don't want to read the files which has been already read. Nevertheless if I add new files and restart app I want to those files to be read.

Question: What can I use to track which files has or has not been read and avoid duplication in db?

Additional info: This is Spring-Boot app. So the question is if there is any maven dependency I can use?

Is it only important that same filenames aren't processed and only new rows are inserted or could it be that data in different files produce the same row data and produce a possible unique/primary key violation? Is it also possible that row data get's updated by the file processing or only new rows? — Nik, May 16 '23 at 11:55
One way to track which files have been read and avoid duplication in the database is to maintain a separate table in the database that stores the names or paths of the files that have been processed. When your application starts, it can query this table to determine which files have already been processed and skip them. — joshii_h, May 16 '23 at 11:59
@joshii_h - this is what I was thinking about, but is there anything available which could makes it automated? — Robert Głowacki, May 16 '23 at 12:34
Perhaps [*Flyway*](https://en.wikipedia.org/wiki/Flyway_(software)) or [*Liquibase*](https://en.m.wikipedia.org/wiki/Liquibase) — Basil Bourque, May 16 '23 at 15:23
@BasilBourque I was looking for some use cases of Liquibase but what I found was only usage of db schema versioning. — Robert Głowacki, May 17 '23 at 06:54

score 2 · Accepted Answer · answered May 16 '23 at 12:00

2

There can be number of options to handle this. One of these could be:

Use pending, in-progress, error, and done - 4 directories.

When App starts, see if in-progress directory has any files. If there are any - that means, the last app-start resulted in an error while processing the files - do something to make it right and move to done OR move to error directory if error persists.

Check in the input directory for new files. Move those to in-progress one by one and import them. After import is a success, move it to the done directory. If import results in an error, and nothing can be done about that file, move to error directory.

The other system which creates the files has to create those in the input directory only. The other 3 directories are for the internal purpose of your app.

Other solution could be DB driven. Have a table in the DB to track the status of each file. Status could be IN_PROGRESS, ERROR and DONE. From your input directory, process only those files that are not there in this DB table. Insert a record in this table with IN_PROGRESS and then import it, and change the status based on the import action's result. Though, in this case, a lot of files will be there in that directory. Over a period of time, the % of already processed files will be huge. So, it will gradually take more time to get the list of files and ignore them, as compared to finding a new file and process it.

And there would be a lot of other ways to handle this, maybe even more efficiently.

answered May 16 '23 at 12:00

Ishan

400
2
8

I think that this DB Driven solution is what I need. According to my use case there will be new files in the future but number wouldn't be huge. Could you provide information is there any solution available for tracking such a files automatically? – Robert Głowacki May 16 '23 at 13:06
1

There is no automatic solution as such. You have to go through each and every file in the directory (`File.listFiles()`) and query `SELECT COUNT(1) FROM tbl WHERE file_name = ?` and see if the file exists already in the table or not, if not, `INSERT INTO tbl (file_name, status) VALUES (?, 'IN_PROGRESS')` and start processing it and then, after the process is complete, `UPDATE tbl SET status = 'DONE' / 'ERROR' WHERE file_name = ?` when done. – Ishan May 16 '23 at 13:10
And by automatic, what exactly do you mean? Does it mean that when a file is created in that directory, your program should be triggered automatically? – Ishan May 16 '23 at 13:13
For automatic trigger of some command upon file creation, check this - https://stackoverflow.com/questions/14692353/executing-a-bash-script-upon-file-creation. In this, you can specify `java your_class` as the command to trigger your program. – Ishan May 16 '23 at 13:15
by automatic I mean that after adding some dependency this code will handle whicih files have already been read, and which not. – Robert Głowacki May 18 '23 at 08:45
1

Ok. So, no automatic trigger when a file is created in the directory. Just the detection of which files are new and which are processed. That will be covered by the DB table I have suggested. – Ishan May 18 '23 at 09:08

score 1 · Answer 2 · answered May 23 '23 at 20:58

Following up on @basil-bourque's suggestion of Perhaps Flyway or Liquibase

Liquibase would work well for this problem. A lot of the examples tend to be around "Tracking schema changes" but it's really best thought of as an engine for ensuring that any database changes are applied consistently and correctly across all your databases. It does that by letting you define a set of change sets which are each uniquely identified, and then tracking each of those separately.

Within those changesets, you can do anything you'd like including reading JSON from files and inserting it into the database. There may not be a built-in function for handling that specific format, like we have for CSV files with insert but it is easy to create your own custom and unique change functions. This page on contribute.liquibase.com is a good starting point for how to do that.

If you write a class like LoadJsonDataChange which has your specific business logic, it will be able to plug into all the existing Liquibase plumbing you will likely need now or later, including:

Tracking which steps have ran and which have not to avoid duplication
Checksum management which detects changes in your config to either re-run or error based on what makes sense for you
Integrations with spring boot and many other systems
Labels, context, and preconditions to dynamically control what is deployed to different databases
And more

score 0 · Answer 3 · answered Jun 19 '23 at 11:56

0

When we run the scripts it also create DATABASECHANGELOG table that allow us to track the changes. You can use this table information to do this.

answered Jun 19 '23 at 11:56

Hassan Javed

1
1

Hi Hassan, thanks for your answer. Can you expand it and share some resources for further reading, or show some example code to help clarify your answer? – Mohammad Athar Jun 20 '23 at 17:58
1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Mohammad Athar Jun 20 '23 at 17:58

Track which files has been read and loaded into db

3 Answers3