0

We know that the sed command loops over each line of a file and for each line, it loops over the given commands list and does something. But when the file is extremely large, the time and resource cost on the repeating operation may be terrible.

Suppose that I have an array of line numbers which I want to use as addresses to delete or print with sed command (e.g. A=(20000 30000 50000 90000)) and there is a VERY LARGE object file.

The easiest way may be: (Remark by @John1024, careful about the line number changes for each loop)

( for NL in ${A[@]}; do sed "$NL d" $very_large_file; done; )>.temp_file;
cp .temp_file $very_large_file; rm .temp_file

The problem of the code above is that, for each indexed line number of the array, it needs to loop over the whole file.

To avoid this, one can:

#COMM=`echo "${A[@]}" | sed 's/\s/d;/g;s/$/d'`;
#sed -i "$COMM" $very_large_file;
#Edited: Better with direct parameter expansion:
sed -i "${A[@]/%/d;}" $very_large_file;

It first print the array and replace its SPACE and the END_OF_LINE with the d command of sed, so that the string looks like "20000d;30000d;50000d;90000d"; on the second line, we treat this string as the command list of sed. The result is that with this code, it only loops over the file for once.

More over, for in-place operation (argument -i), one cannot quit using q with sed even though the greatest line number of interest has passed, because if so, the lines after the that line (e.g. 90001+) will disappear (It seems that the in-place operation is just overwriting the file with stdout).

Better ideas?

(Reply to @user unknown:) I think it could be even more efficient if we manage to "quit" the loop once all indexed lines have passed. We can't, using sed -i, for the aforementioned reasons. Printing each line to a file cost more time than copying a file (e.g. cat file1 > file2 and cp file1 file2). We may benefit from this concept, using any other methods or tools. This is what I expect.

PS: The points of this question are "Lines location" and "Efficiency"; the "delete lines" operation is just an example. For real tasks, there are much more - append/insert/substituting, field separating, cases judgement followed by read from/write to files, calculations etc. In order words, it may invoke all kind of operations, creating sub-shells or not, caring about the variable passing, ... so, the tools to use should allow me to line processing, and the problem is how to get myself onto the lines of interest, doing all kinds operations.

Any comments are appreciated.

LDecem
  • 7
  • 5
  • 1
    You are correct that `sed -i` does not really write-in-place. It first creates a new file and then replaces the old file with the new one. For details on this, see [this answer](http://stackoverflow.com/a/27075334/3030305). – John1024 Feb 27 '18 at 06:32
  • 1
    A key issue is that, even if you remove just one line in the file, the byte position of every line that follows must be changed. Is there any alternative to "deleting" lines? Could you comment one out (say, replace the first character on the line with `#`) without changing the length of the line and therefore avoiding the need to move all remaining lines? – John1024 Feb 27 '18 at 06:36
  • @John1024 yes I have forgot this issue, thank you. For the method 1, this issue exists, and your suggestion should work; or just simply not use the in-place argument, instead quote the whole code and redirect the output to the file. For the method 2, this issue dose not exist. – LDecem Feb 27 '18 at 06:54
  • ‘sed file1 > file1’ has undefined behavior. I’m not sure the loop fixes it. Obviously your 2nd solution is better anyway. – zzxyz Feb 27 '18 at 07:25
  • Another problem with looping over the file is that you need to start with the highest linenumber. When you want to remove lines 1 and 5, and start with removing line 1, you need to remove line 4 in the next loop. – Walter A Feb 27 '18 at 07:38
  • 1
    @zzxyz maybe I should say in another way... sed without silence "-n" prints things to stdout, (normally to the screen); but with ( .. ) > file, the stdout within in the parentheses redirects to the file. Ah, BTW, there were typing mistakes, it's edited now. – LDecem Feb 27 '18 at 08:00
  • So what do you expect? What should be a more efficient method, than looping once over the file? – user unknown Feb 27 '18 at 08:09
  • @LDecem: My system isn't the fastet one, 2dualcores, 5y'o SSD, 50 GB SSD, cp a file of 1.5 MB, 5000 lines text, or sed the same file `sed '1d;10d;100d;1000d' $f > $f.sed.out`, below 1s, same time as cp. The same for a file with 100.000 lines but about 0.5M in size. cp 1s and sed 2s. Too small, to make definitive claims. How big are are your files to have concerns? When the result differs in size, you have to read the whole file, there is no way around it, ... – user unknown Feb 27 '18 at 14:43
  • ... except developing your own filesystem of linked lines, where you link from line 90.000 to line 90.001 and are finished, after linking from 89.999 to 90.001, no matter how many lines will follow. – user unknown Feb 27 '18 at 14:44
  • @userunknown Well in fact I have a lot of string operations so it "multiply" the actual quantity of data... Several hundreds of MB for each file, and... thousands of files... Besides, I'm doing Monte Carlo things so there is probabilistic,statistics, etc., and now I think I should use another language though. – LDecem Feb 28 '18 at 06:53
  • @userunknown It's not even an SSD, just a RAID with hard-disks, but it's not full-speed functioning - the processors hold the limits – LDecem Feb 28 '18 at 06:59

3 Answers3

1

First make a copy to a testfile for checking the results. You want to sort the linenumbers, highest first.

echo "${a[@]}" | sed 's/\s/\n/g' | sort -rn 

You can feed commands into ed using printf:

printf "%s\n" "command1" "command2" w q testfile | ed -s testfile

Combine these

printf "%s\n" $(echo "${a[@]}" | sed 's/\s/\n/g' | sort -rn | sed 's/$/d/') w q |
   ed -s testfile

Edit (tx @Ed_Morton):
This can be written in less steps with

printf "%s\n" $(printf '%sd\n' "${a[@]}" | sort -rn ) w q | ed -s testfile

I can not remove the sort, because each delete instruction is counting the linenumbers from 1.
I tried to find a command for editing the file without redirecting to another, but I started with the remark that you should make a copy. I have no choice, I have to upvote the straight forward awk solution that doesn't need a sort.

Walter A
  • 19,067
  • 2
  • 23
  • 43
  • Impressive using ed. May I ask if there is reduction on computational efficiency? It seems that it still need to read the whole file anyway – LDecem Feb 27 '18 at 09:04
0

sed is for doing s/old/new, that is all, and when you add a shell loop to the mix you've really gone off the rails (see https://unix.stackexchange.com/q/169716/133219). To delete lines whose numbers are stored in an array is (using seq to generate input since no sample input/output provided in the question):

$ a=( 3 7 8 )
$ seq 10 |
    awk -v a="${a[*]}" 'BEGIN{split(a,tmp); for (i in tmp) nrs[tmp[i]]} !(NR in nrs)'
1
2
4
5
6
9
10

and if you wanted to stop processing with awk once the last target line has been deleted and let tail finish the job then you could figure out the max value in the array up front and then do awk on just the part up to that last target line:

max=$( printf '%s\n' "${a[@]}" | sort -rn | head -1 )
head -"$max" file | awk '...' file > out
tail +"$((max+1))" file >> out

idk if that'd really be any faster than just letting awk process the whole file since awk is very efficient, especially when you're not referencing any fields and so it doesn't do any field splitting, but you could give it a try.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    I didn't downvote (in fact about to upvote), but you might explain the purpose of `seq` in your block is to provide test data to awk so that you can demonstrate the output. Probably *totally* obvious to you, but I missed it on my first read. – zzxyz Feb 27 '18 at 21:18
  • 1
    In fact I'm using sed to do a quite amount of insertion/append/substitution to specific lines (matched line number or regex), though awk can do this in nearly the same way. I find that the procedures of both sed and awk are quite similar, except that sed has the "hold space", which for me is quite powerful on reducing processing time - though it might be the same as assigning the pattern to an array, for example. With your suggestion I manage to recognize other places to improve. Thank you very much, sincerely. – LDecem Feb 28 '18 at 06:45
  • @LDecem you're welcome. From your comment, though, I don't think you really understand what awk is. seds hold space and every other sed language construct except s, g, and p (with -n) became obsolete in the mid 1970s when awk was invented. sed is a great tool which I've been using almost daily for around 35 years and still use today but the stuff people force it to do with convoluted strings of single character runes just leaves me shaking my head when there's always a clearer, simpler, more efficient, more easily extensible, more robust, and more portable awk solution. – Ed Morton Feb 28 '18 at 13:57
  • 1
    @EdMorton Quite right, I have been forcing myself on using shell and bash in works starting from 0 and this has not been longer than 3 months... I swallow every thing up and there is many mis-understanding that makes people laugh. Still a long way to go. – LDecem Feb 28 '18 at 14:56
  • Just be careful with sed questions/answers - a lot of people will post immensely complicated sed solutions so everyone is in awe of how brilliant they must be to be able to figure that out when in fact there's an absolutely trivial awk solution and the sed one is just for the mental exercise! Also many times the "sed solution" is actually a sed+grep+shell+tr+other mish-mash. If you need sed+sed or sed+grep or grep+grep or any other similar combination, just use awk instead. – Ed Morton Feb 28 '18 at 14:58
0

You could generate an intermediate sed command file from your lines.

echo ${A[@]} | sort -n > lines_to_delete
min=`head -1` lines_to_delete
max=`head -1` lines_to_delete
# skip to first and from last line, delete the others
sed -i -e 1d -e ${linecount}d -e 's#$#d#' lines_to_delete
head -${min} input > output
sed -f lines_to_delete input >> output
tail -${max} input >> output
mv output input
daniu
  • 14,137
  • 4
  • 32
  • 53