0

Let's consider 2 text file, one 'main_list', and one 'ignore_list'. For each line in the ignore_list, I want to remove the line starting with that string in the main_line.

basically something doable with sed and a while loop.

E.g.

while read line; do echo ^$line; sed -i "/^$line/d" ./main_list; done < ./ignore_list

In a better way, I wanted to first create the sed pattern and then run it once:

while read line; do
    if [ $SED_PATTERN="" ]; then 
      SED_PATTERN="^$line"
    else
      SED_PATTERN=$SED_PATTERN"\|^$line"
    fi
  done < ./ ignore_list
echo $SED_PATTERN
sed -i "/$SED_PATTERN/d" ./main_list

unfortunately, because of the sub shell used by the while loop, it does not work.

A variable modified inside a while loop is not remembered and https://mywiki.wooledge.org/BashFAQ/024 are giving worthful explanations and workaround. I haven't managed it yet to get one working in a simple way.

Ideally, I want to use the sh shell (the script will run in a gitlab pipeline with a simple alpine image)

Any idea to keep it simple before I move to a python script (and use a fat image instead of alpine - in between, I can also use one with bash)

Maybe another approach than sed and the while loop?

Thanks.

edit: some more context about the content of both files: I am dealing with a list of debian packages installed from a build step. The main_list is then an output of a dpkg-query command (see below), so should not contain too fancy characters. The ignore_list contains the packages I want to ignore for another post processing step, containing internal components not relevant for my output.

Here a small extract of both files

main_list

e2fsprogs|1.46.2-2|e2fsprogs|1.46.2-2
ebtables|2.0.11-4|ebtables|2.0.11-4
edgeonboarding-config|0.1|edgeonboarding-config|0.1
efibootguard|0.13+cip|efibootguard|0.13+cip
ethtool|1:5.9-1|ethtool|1:5.9-1

for the ignore_list

edgeonboarding-config

You can generate the main_list on any linux system by running

dpkg-query -f '${source:Package}|${source:Version}|${binary:Package}|${Version}\n' -W > main_list

and for the ignore_list, just pick-up a few string from the main_list (begining of the lines)

EDIT2: anyway, my initial idea with a while loop is not necessary. I just need

  • one sed command over ignore_list to replace any line $myline and return carriage with ^$myline|
  • set the output as SED_PATTERN
  • and set run another sed command: sed -i "/$SED_PATTERN/d" ./main_list
EricBDev
  • 1,279
  • 13
  • 21
  • 1
    Check your script with shellcheck. – KamilCuk Aug 10 '23 at 21:26
  • 1
    `[ $SED_PATTERN="" ]` is always `true` ... – Jetchisel Aug 10 '23 at 21:27
  • 1
    please update the question with a few lines from both files (include a mix of matching and non-matching lines) and the expected output – markp-fuso Aug 10 '23 at 21:27
  • 1
    your `while` loop is not executed in a subshell – markp-fuso Aug 10 '23 at 21:29
  • Also, you haven't look at the [intersect page](https://mywiki.wooledge.org/BashFAQ/036) – Jetchisel Aug 10 '23 at 21:33
  • Darn, we should have waited to see your input before posting answers. From what you show now, you shouldn't be trying to match lines that start with the strings from ignore_list, you should be matching lines where the first `|`-separated field is present in ignore_list - that's a quite different requirement needing a different but potentially simpler answer. – Ed Morton Aug 11 '23 at 09:20
  • I had some wildcard like "bla-*' in my initial ignore-list, thus my initial requirement. But I changed to full string to match your simplification, it makes indeed more sense. – EricBDev Aug 11 '23 at 11:37

3 Answers3

3

You can do this with the grep -v command. Use the -f option to read the list of patterns to filter out from a file. Use process substitution to put ^ at the beginning of every line in ignore_list and use that as the pattern file.

grep -v -f <(sed 's/^/^/' ignore_list) main_list > main_list.new && mv main_list.new main_list
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • We don't know what characters can occur inside `ignore_list`. If they are specific regex characters, you need to escape them in your `sed`, and this is the point where I indeed would switch to i.e. Perl or Ruby or Python. – user1934428 Aug 11 '23 at 06:15
  • I added more context to my files. The grep -v command seems the good path. The proposed solution is however not yet working: it only removes the last line of the ignore_list – EricBDev Aug 11 '23 at 06:57
  • Any chance the `ignore_list` file has CRLF newlines? Fix that with `dos2unix` – Barmar Aug 11 '23 at 14:47
3

Using any POSIX awk given the input/output you've recently added to your question:

awk -F'|' '
    NR==FNR {
        sub(/[[:space:]]+$/,"")
        ign[$0]
        next
    }
    !($1 in ign)
' ignore_list main_list

That is doing a literal full string comparison against just the first |-separated field of each line.

If you were to use sed and/or grep for this then you'd need to escape all possible regexp metachars in ignore_list first, see is-it-possible-to-escape-regex-metacharacters-reliably-with-sed.


Original answer before you showed us sample input/output:

Using any POSIX awk (untested due to no sample input/output provided):

awk '
    NR==FNR {
        sub(/[[:space:]]+$/,"")
        ign[$0]
        next
    }
    {
        for ( str in ign ) {
            if ( index($0,str) == 1 ) {
                next
            }
        }
    }
' ignore_list main_list

That is doing a literal substring string comparison against just the start of each line.

If you were to use sed and/or grep for this then you'd need to escape all possible regexp metachars in ignore_list first, see is-it-possible-to-escape-regex-metacharacters-reliably-with-sed.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • thanks, another welcome input. However not yet working, for some reasons, only 2 lines of my main_list get removed, while my ignore_list currently contains 21 lines. – EricBDev Aug 11 '23 at 07:08
  • Maybe you have spaces at the end of lines in ignore_list. In particular, it might have Carriage Returns, see https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it. I updated my answer to strip spaces from the end of lines in ignore_list just in case they're present. – Ed Morton Aug 11 '23 at 09:12
  • thanks a lot, your last script with awk is working well. I just changed the 'ignore_list main_list' to '$1 $2 > $3' so that it is more flexible to use. – EricBDev Aug 11 '23 at 11:34
0

This might work for you (GNU sed):

sed 's#.*#/^&|/d#' ignore_list | sed -f - main_list

Create a sed program from the ignore_list and apply it against main_list.

N.B. If there are likely to be metacharacters in the ignore_list these will need to be escaped.

potong
  • 55,640
  • 6
  • 51
  • 83