How to take control on files in Linux before processing starts - bash

Question

I am currently working on project to automate a manual task in my office. We have a process that we have to re-trigger some of our ID's when they fall in repair. As part of the process, we have to extract those ID's from a oracle DB table and then put in a file on our Linux server and run the command like this-

Example file:

$cat /task/abc_YYYYMMDD_1.txt
23456
45678

...and so on

cat abc_YYYYMMDD_1.txt | scripttoprocess -args

I am using an existing java based code called 'scripttoprocess'. I can't see what's inside this code as it is encrypted( it seems) in my script. I simply go to the location where my files are present present and then use it like this:

cd /export/incoming/task
for i in `ls abc_YYYYMMDD*.txt`;do 
cat $i | scripttoprocess -args
if [ $? -eq 0];then
mv $i /export/incoming/HIST/
fi
done

scripttoprocess is and existing script. I am just calling it in my own script. My script is running continuously in a loop in the background. It simply searches for abc_YYYYMMDD_1.txt file in /task directory and if it detects such a file then it starts processing the file. But I have noticed that my script starts processing the file well before it is fully written and sometime moves the file to HIST without fully processing it.

How can handle this situation. I want to be fully sure that file is completely written before I start processing it. Secondly, Is there any way to take control of the file like preparing a control file which contains list of the files which are present in the /task directory. And then I can cat this control file and pick up file names from inside of it ? Your guidance will be much appreciated.

The process that writes the file should write to a folder or give the file a name that indicates it is incomplete. Then when it completes it should then "mv" the file to the folder when it gets picked up or rename the file, ensuring the two folders (source and dest) are on the same filesystem. — TenG, May 24 '18 at 19:39
HeyTenG, thanks for your quick comment, this seems to be a solution we also have thought about. Any idea if we can figure about if a file is being used by another process already ? — Subodh, May 24 '18 at 19:57
@Subodh, renames are atomic, so if you rename a file to its final name when you're done writing it and finished flushing, there's no possibility for it to be partial/incomplete. — Charles Duffy, May 24 '18 at 20:29
@Subodh, yes, it's possible to check if a file is in-use, but those checks are **not** atomic, so they're actually much more likely to be troublesome than depending on the guaranteed-atomic `rename()` semantics. — Charles Duffy, May 24 '18 at 20:31
BTW, `for i in $(ls ...)` is extremely buggy; see [BashPitfalls #1](http://mywiki.wooledge.org/BashPitfalls#for_i_in_.24.28ls_.2A.mp3.29) and [ParsingLs](http://mywiki.wooledge.org/ParsingLs). — Charles Duffy, May 24 '18 at 20:33

score 1 · Answer 1 · answered May 24 '18 at 19:41

1

I used

iwatch -e close_write -c "/usr/bin/pdflatex -interaction batchmode %f" document.tex

To run a command (Latex to PDF conversion) when a file (document.tex) is closed after writing to it, which you could do as well.

However, there is a caveat: This was only meant to catch manual edits to the file and failure was not critical. Therefore, this ignores the case that immediately after closing, it is opened and written again. Ask yourself if that is good enough for you.

answered May 24 '18 at 19:41

Ulrich Eckhardt

16,572
3
28
55

Ulrich Eckhardt sorry but I can't ignore this because sometimes abc_YYYYMMDD*.txt files contains 100's of records and it becomes difficult to figure out whether the file was fully written, also how much it was processed and what was left. – Subodh May 24 '18 at 19:54
@Subodh, `close_write` happens only when the file is *closed* after a write. If it's closed, it's necessarily done being written. The OP's caveat is only pertinent if files can be *reopened* and modified after they were already completely and fully written. – Charles Duffy May 24 '18 at 20:30
1

@Subodh, ...if you *do* need to handle the possibility of files being reopened and modified, consider using `flock()`-based advisory locking to mark them as in-use; only one process can have such a lock on a file at a time. – Charles Duffy May 24 '18 at 20:32
@Subodh, ...if files *can* be reopened, you can't safely use battlmonstr's answer either, because you could do the `lsof` check, then open the file, and then have it reopened and modified by the other/background process. Which is to say that if that's really the case that things can be reopened after they're closed, you have no safe option *except* modifying the other process(es) that can be accessing content to use advisory locking... but if it's *not* really the case, the answer given by Ulrich here is the standard/best-practice approach. – Charles Duffy May 24 '18 at 20:47
BTW, I would personally suggest [incron](https://inotify.aiken.cz/?section=incron&page=why&lang=en) as a way to run a program as soon as any file in a directory is created/changed/etc. – Charles Duffy May 24 '18 at 20:50
Thanks Charles, looks like I have tons of ideas here :). If close_write happens only when the file is closed after a write then I would try to implement it. Moreover because I do not want to change the other process which generates these files. Thanks once again to @Ulrich and TenG – Subodh May 25 '18 at 14:26

score 1 · Answer 2 · answered May 24 '18 at 20:20

I agree with @TenG, normally you shouldn't move a file until it is fully written. If you know for sure that the file is finished (like a file from yesterday) then you can move it safely, otherwise you can process it, but not move it. You can for example process a part of it and remember the number of processed rows so that you don't restart from scratch next time.

If you really really want to work with files that are "in progress", sometimes tail -F works for this case, but then your bash script is an ongoing process as well, not a job, and you have to manage it.

You can also check if a file is currently open (and thus unfinished) using lsof (see https://superuser.com/questions/97844/how-can-i-determine-what-process-has-a-file-open-in-linux ; check if file is open with lsof ).

score 0 · Answer 3 · answered May 24 '18 at 20:43

Change the process, that extracts the ID's from the oracle DB table. You can use the mv as commented by @TenG, or put something special in the file that shows the work is done:

#!/bin/bash
source file_that_runs_sqlcommands_with_credentials
output=$(your_sql_function "select * from repairjobs")
# Something more for removing them from the table and check the number of deleted records
printf "%s\nFinished\n" "${output}" >> /task/abc_YYYYMMDD_1.txt

or

#!/bin/bash
source file_that_runs_sqlcommands_with_credentials
output=$(your_sql_function "select * from repairjobs union select 'EOF' from dual")
# Something more for removing them from the table and check the number of deleted records
printf "%s\n" "${output}" >> /task/abc_YYYYMMDD_1.txt

How to take control on files in Linux before processing starts - bash

3 Answers3