Removing rows in a dataset matching a value from a separate dataset

Question

I am having some complications with matching strings to each other.

Say I have the following table:

broken
vector
unidentified
synthetic
artificial

And I have a second dataset that looks like this:

org1    Fish
org2    Amphibian
org3    vector
org4    synthetic species
org5    Mammal

I want to remove all the rows from the second table that match the string from the first table so that the output looks like this:

org1    Fish
org2    Amphibian
org5    Mammal

I was thinking of using grep -v in bash, but I am not quite sure how to make it loop through all the strings in table 1.

I am trying to work it out in Perl, but for some reason it returns all my values instead of just the ones that match. Any idea why?

My script looks like this:

#!/bin/perl -w

($br_str, $dataset) = @ARGV;
open($fh, "<", $br_str) || die "Could not open file $br_str/n $!";

while (<$fh>) {
        $str = $_;
        push @strings, $str;
        next;
    }

open($fh2, "<", $dataset) || die "Could not open file $dataset $!/n";

while (<$fh2>) {
    chomp;
    @tmp = split /\t/, $_;
    $groups = $tmp[1];
    foreach $str(@strings){
        if ($str ne $groups){
            @working_lines = @tmp;
            next;
        }
    }
        print "@working_lines\n";
}

Hello, I added chomp to my script and I get the same result..seems to read the first set fine so I am not sure what is the issue.. — Aletia, May 10 '19 at 18:49
See [this post](https://stackoverflow.com/a/55734041/4653379) for another approach to a similar problem. — zdim, May 10 '19 at 18:55

score 2 · Accepted Answer · answered May 10 '19 at 18:54

2

chomp your input and use a hash for your first table:

use warnings;
use strict;

my ( $br_str, $dataset ) = @ARGV;
open(my $fh, "<", $br_str ) || die "Could not open file $br_str/n $!";

my %strings;
while (<$fh>) {
    chomp;
    $strings{$_}++;
}

open(my $fh2, "<", $dataset ) || die "Could not open file $dataset $!/n";
while (<$fh2>) {
    chomp;
    my @tmp = split /\s+/, $_;
    my $groups = $tmp[1];
    print "$_\n" unless exists $strings{$groups};
}

Note that I used \s+ instead of \t, just to make my copy/paste easier.

answered May 10 '19 at 18:54

toolic

57,801
17
75
117

But are we sure it's words? What if they need to exclude `synthetic-species` for `synthetic`? I think regex is safer than a lookup. If it is all words then this is very cool :) – zdim May 10 '19 at 19:00
Hei zdim, For my case, I wanted it exactly like that so that any time it finds synthetic, it will remove the whole row, regardless of what the rest of the string says. So the script works like a charm :) . – Aletia May 10 '19 at 19:03
@Aletia Great then :) That should be mentioned in the question or at least in the asnwer, that all text can only contain key-words as words – zdim May 12 '19 at 06:19

Removing rows in a dataset matching a value from a separate dataset

1 Answers1