How to avoid subshell issues with reading lines from a file?

Question

"Way back in the day", mid 1990s, I had to write a C program I called "readline" to avoid the global vs local variable issue created by subshelling when one did a construct like:

while read line
do
   my_var=$(echo "$line" | cut -f 12 -d ":")
   if [ "$my_var" == "$target" ] ;
   then
      found_target=1
   fi
done < some_file

While updating this question to hopefully address some comments, I realized another issue I'm clueless about regarding this type of loop; how do you implement "we have found the target, we can quit reading now! with this type of loop? I'd guess it'd be something involving:

while [ -z "$found_target" ] &&

But I don't know how to finish the line! To "work correctly" the example would have to leave found_target and my_var as set in the loop for code following the loop to use.

Note that the code example here is not a code example I'm working with today for the simple reason that "having been burned" issues created by the input redirection (< file construct) to a while loop, I don't do that any more! In the "backstory" bit below, you might see how this idea got started, but it could be all based on a misunderstanding of Bash.

In short, it was observed that variables set in the loop during processing of the read - such as found_target in this example - were lost when the loop exits. Someone who was supposed to be a Bash expert (back in the 1995 to '97 era) told us - the team I was leading - it was because the interior of the loop was put into a subshell. I was a database guy who'd done machine language coding, etc, etc, etc, and didn't even think of Bash as a programming language. So, given the problem to solve as handed to me, I just handed back to the team a readline program that allowed them to move beyond their difficulties.

My simple program just has one or two arguments; you tell it the integer line number you want and either pass a file via stdin or point it at a file via a filespec. It wasn't as inefficient as one might think due to operating system caching. And, it let these even less skilled programmers get on with their work.

This program was very satisfactory, especially for large files - the larger the file, the bigger the win since Bash isn't (or at least wasn't) particularly efficient at such use, to say nothing of the subshell / global variable issue. (Please note the section below for the use-case on WHY this would make sense!)

Now, however, I'd like to revisit this issue for two reasons: 1) Bash and its attendant utilities has/have come a long way in the intervening two plus decades, and; 2) I'd like to provide a bit of software to someone without the dependency on my readline program and for that issue, the subshell issue is the real problem - that and that the people who will be working with it are, like the original people I wrote readline for, not really programmers. However, if there's an open-source version of my readline, that'd work just fine!

In addition to those motives, while I've come a long way in understanding Bash since then, it's still a mostly a tertiary issue for me and I know I'm still profoundly ignorant of large chunks of it. And one thing I'm thinking could perhaps be "the right way" would be a more intelligent use of functions. Back then I was ignorant of the ability to redirect into and out of a BASH function. And, frankly, while now I know "it's a thing," I've never actually used it yet.

Some Backstory

This is definitely more the kind of thing for the "Retro-Computing" community: Back in 1995 or 1996 when this "don't do that!" idea came about, Bash was used as a part of a "glue layer" trying to join around 7 systems or so designed by different teams for different aspect of Earth Science. None of these systems were all that well designed, either, being done by Earth scientists whose passion was Earth, not computers. For most, their idea of a database was crude, typically huge lines of text in a big file, and all they wanted was to pick through a few possibly adjacent lines in the middle of what were at the time considered gigantic files. And, to join, say, the atmospheric data with the ocean surface data, the best that was practical was for some grad-students or post-doc to write Bash code that took little bits of other code and bring it together.

For what it's worth, my goal was to get 'em to use relational database engines and, indeed, modern PostgreSQL came from the same lab at the same time. However, the best I managed was to use the database as the meta-layer, knowing what data was in what systems, how to get to those systems, and what programs to call to actually do the scientific part of the data joins. Hope this digression gives it some perspective on why!

Hey, if the whole issue of subshells is just plain wrong, please school me! I can be taught! Else, a suggestion on replacing my readline would be nice.

Add to your loop `x="$line"` and after your loop `echo "$x"`? — Cyrus, Aug 05 '23 at 23:56
I'm not sure i understood the goal here, but `awk 'NR==123' file.txt` or `head -n 123 file.txt | tail -n 1` or `sed -n '123p' file.txt'` would grab a particular line from a file. — Mikael Öhman, Aug 06 '23 at 00:09
I'm very confused about both the problem and why `readline` was a solution to it. The `while read ... done < some_file` loop you give in the question will *not* run anything in a subshell, so it won't have any of the problems associated with them. If you used `somecommand | while read ...`, then the pipe would make the loop run in a subshell, but this doesn't happen for plain redirections (see [BashFAQ #24](http://mywiki.wooledge.org/BashFAQ/024) for more info and workarounds). — Gordon Davisson, Aug 06 '23 at 00:30
Also, if you're actually working through the file line-by-line, the `while read` loop will be faster than using an external program, *especially* for large files. This is because to get line 1000, the program has to read & discard the first 999 lines; then, to get line 1001, it has to read those 999 lines *and* #1000, and discard them *again*; etc. This gives quadratic performance: doubling the size of the file quadruples the time it takes to process all of the lines. And that's on top of the fact that running an external command involves creating a process, which is slow by itself. — Gordon Davisson, Aug 06 '23 at 00:34
Yes, bash's `read` is slow (insofar as it calls the `read` syscall one character at a time) -- but running a new copy of `sed` or `awk` or any other tool for each line you want to read is even slower. Please ask a concrete practical question so we can dig into the performance of a specific real-world scenario rather than something so high-level that any answer is speculative in nature. — Charles Duffy, Aug 06 '23 at 01:00
And the `while read` loop you show **doesn't** create any syscalls at all. Maybe you're thinking of `cat some_file | while read line`, but that's why `cat some_file |` is bad form. — Charles Duffy, Aug 06 '23 at 01:01
@CharlesDuffy, bash's `read` reads input one character at a time only when reading from non-seekable sources such as pipes and sockets. When reading directly from seekable sources it reads input in blocks and uses "seek" to set the file input position just after the end of the first complete line of input. — pjh, Aug 06 '23 at 12:40
(ugh, "doesn't create any syscalls" should of course have been "doesn't create any subshells") — Charles Duffy, Aug 06 '23 at 13:06
Thanks for all the great input - I'll clarify and address these things as soon as I can. — Richard T, Aug 07 '23 at 14:55
@CharlesDuffy Just tried to address the comments but only noticed yall'd closed it presuming it's about efficiency?! It's not, it's about the variables. Hope you'll re-read and re-consider! — Richard T, Aug 07 '23 at 16:27
The edit changes the problem quite thoroughly! `my_var=$(echo "$line" | cut -f 12 -d ":")` is where you have the efficiency problem. — Charles Duffy, Aug 07 '23 at 16:29
Think about `while IFS=: read -r -a vars; do my_var=${vars[11]}; ...; done — Charles Duffy, Aug 07 '23 at 16:30
Or, maybe, `while read -r my_var; do ...; done < <(cut -f12 -d: — Charles Duffy, Aug 07 '23 at 16:30
But still, the only subshell in your code is the one created by `$( )`, so variables set _inside_ `$()` are lost, but they're the only ones; nothing in your code otherwise creates a limited variable scope. If you're seeing contrary behavior, [edit] to provide a [mre] that shows that behavior when run without changes. — Charles Duffy, Aug 07 '23 at 16:32
BTW, if you want to exit a loop early, that's what `break` is for, so you can put `break` right after `found_target=1`. — Charles Duffy, Aug 07 '23 at 16:33
As another aside, this one portability-related, `=` is better form than `==` inside `[` -- see the POSIX specification at https://pubs.opengroup.org/onlinepubs/9699919799/utilities/test.html; only `=` is standardized, `==` is a nonstandard extension, and some versions of sh outright disallow it. — Charles Duffy, Aug 07 '23 at 16:35
Again, though: `found_target` **is** left in place by when your loop exits already, and the only reason `my_var` isn't left in place is because you aren't using `break`, so later lines in the file overwrite it with values from those future lines. **If you're seeing anything else, we need a [mre] we can run without changes to see that problem too**; consider testing your examples in a sandbox like https://repl.it/ to make sure they reproduce the problem on their own. — Charles Duffy, Aug 07 '23 at 16:37
(No, I didn't misread your question as being about efficiency, but because efficient patterns don't create subshells, for the most part fixing efficiency problems _also_ fixes scoping problems; the exceptions are things like [BashFAQ #24](https://mywiki.wooledge.org/BashFAQ/024), but the code you've shown us thus far isn't subject to the problem that FAQ discusses. We don't need your personal history and career trajectory; we _do_ need a terse reproducer for a narrow, specific technical problem that lets someone else see that problem on their own host and test solutions). — Charles Duffy, Aug 07 '23 at 16:39
See https://ideone.com/50WBZM -- your original code, successfully finding the target, with the variables all still persisting after the loop exited. We need an example that's equally clear that lets us see your code that _doesn't_ work. — Charles Duffy, Aug 07 '23 at 16:45
BTW, notice the `while read -r my_var; do ...; done < <(cut -f12 -d: — Charles Duffy, Aug 07 '23 at 16:52
@CharlesDuffy WOW, Charles, THANKS BUNCHES! I got some coffee, returned, and you've just hit me with a dozen great comments already! ;-) ... The MOST important one, I think: "nothing in your code otherwise creates a limited variable scope. If you're seeing contrary behavior" HOLLY COW?! ...I feel VERY sure that WAY back then, the code presented to me had that issue, but it's "lost to history" now. I didn't know enough them and I never revisited it! . THIS makes me VERY happy - AND, your other efficiency comments are great, too, but more of a case-by-case. basis - should I delete the question? — Richard T, Aug 07 '23 at 17:02
@CharlesDuffy Just got done reading all your comments and going through them; VERY helpful, THANKS. I learned in the last 5 years or less that BASH is far more powerful and useful than I'd thought before and think that many computer folks are put off of thinking of it properly due to either not considering it a programming language or discouraged by code that's unreadble / unintelligible by non-experts. Your explanations are superb. I'm embarrassed I didn't know better and never revisited the issue, but glad I asked! — Richard T, Aug 07 '23 at 17:26

How to avoid subshell issues with reading lines from a file?

0 Answers0