0

enter code hereHi everyone I have a data frame such as :

I have a file such as:

scaffold_1_1    X   2   2
scaffold_24_0   X   9   2
scaffold_15 X   2   2
IDBA_scaffold_30_1  X   2   317
scf7180005161000_2  X   1   2

And the idea is simply to remove the last number part of all names in the first but there are 3 types of scaffolds_names:

scaffold_number0_number1
scaffold_number0
IDBA_scaffold_number0_number1
scfXXX_number1

and the idea is to remove all the number_1, here is the result I should get in this example:

scaffold_1  X   2   2
scaffold_24 X   9   2
scaffold_15 X   2   2
IDBA_scaffold_30    X   2   317
scf7180005161000    X   1   2

Have you an idea to deal with that?

Thank you for you help.

Inian
  • 80,270
  • 14
  • 142
  • 161
bean
  • 53
  • 6

4 Answers4

2

1st Solution: Could you please try following.(in case someone simply want to substitute last _ and following digits then only following may help.

awk '{sub(/_[0-9]+$/,"",$1)} 1'  Input_file

2nd solution:

In case you want to check if there should be more than 2 _ values in 1st field which is starting from string sacffold then try following.

awk '(/scaffold/ && num=split($1,a,"_")>2) || /scf/{sub(/_[0-9]+$/,"",$1)} 1' Input_file

Output will be as follow.

scaffold_1 X 2 2
scaffold_24 X 9 2
scaffold_15 X   2   2
IDBA_scaffold_30 X 2 317
scf7180005161000 X 1 2
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

You can try Perl,

perl -pe ' s/(^\S+)_\d\b/$1/g ' 

with your inputs

$ cat bean.txt
scaffold_1_1    X   2   2
scaffold_24_0   X   9   2
scaffold_15 X   2   2
IDBA_scaffold_30_1  X   2   317
scf7180005161000_2  X   1   2
$ perl -pe ' s/(^\S+)_\d\b/$1/g ' bean.txt
scaffold_1    X   2   2
scaffold_24   X   9   2
scaffold_15 X   2   2
IDBA_scaffold_30  X   2   317
scf7180005161000  X   1   2
$

Thanks @anubhava for catching one of the edge cases and helping to fix it.

$ cat bean2.txt
scaffold_1_1    X   2   2
scaffold_24_0   X   9   2
scaffold_15 X   2   2
IDBA_scaffold_30_1  X   2   317
scaffold_1_15     X   2   2  # => this was not fixed in first answer
$ perl -pe 's/^(?!scaffold_\d+\b)(\S+)_\d+\b/$1/g' bean2.txt
scaffold_1    X   2   2
scaffold_24   X   9   2
scaffold_15 X   2   2
IDBA_scaffold_30  X   2   317
scaffold_1     X   2   2
$
stack0114106
  • 8,534
  • 3
  • 13
  • 38
  • 1
    @anubhava sir, you are right.. I just tried ````perl -pe ' s/(^.+?)(_\d+)?_\d+\b/$2?"$1$2":$&/ge '````, pls review.. if ok, I'll add to the answer – stack0114106 Feb 12 '19 at 15:24
  • That may work for OP's data but new regex might be error prone. I think it would be better to use negative lookahead e.g. `perl -pe 's/^(?!scaffold_\d+\b)(\S+)_\d+\b/$1/g' file` – anubhava Feb 12 '19 at 15:30
  • @anubhava sir, I need your help in reviewing my regex solution to https://stackoverflow.com/questions/54972535/delete-duplicate-lines-only-if-they-match-a-pattern/54985530#54985530 please review when you get time – stack0114106 Mar 04 '19 at 14:51
1

Here is another awk variant:

awk 'BEGIN{FS=OFS="\t"} $1 ~ /^scf[0-9]+_[0-9]+$/ || split($1, a, "_") > 2 {
sub(/_[0-9]+$/, "", $1) } 1' file

scaffold_1  X   2   2
scaffold_24 X   9   2
scaffold_15 X   2   2
IDBA_scaffold_30    X   2   317
scf7180005161000    X   1   2
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    @RavinderSingh13: Thanks, ++ for your solution. I think OP wants `scf7180005161000` in output instead of `scf7180005161000_2` – anubhava Feb 12 '19 at 15:00
0

Using any sed that supports -E for EREs, e.g. GNU or OSX/BSD seds:

$ sed -E 's/((_|scf)[0-9]+)_[0-9]+/\1/' file
scaffold_1    X   2   2
scaffold_24   X   9   2
scaffold_15 X   2   2
IDBA_scaffold_30  X   2   317
scf7180005161000  X   1   2
Ed Morton
  • 188,023
  • 17
  • 78
  • 185