3

I would like to find a particular pattern ("k__"), and any characters after it, up to a space, and then move that captured pattern to the end of the line

With this example file:

cat test.file
37099   k__Eukaryota species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
43925   k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925   k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925   k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925   k__Bacteria species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__

So, Id like to match "k__Eukaryota" and "k__Bacteria" (and other patterns that start with k__) and then move those captured matches to the end of the line : e.g. desired output=

37099    species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Bacteria

I thought it woudl be easy but I can;t get it to go. Here is what ive tried:

cat test.file | gsed -E 's#(.*k__)(k__\w\+)(.*)#\1\3\2#'

Cupture text until pattern, then match (cpature pattern and any word characters up to whitespace) then capture to the end of the line and then change the order of capturing groups.

I think I can back reference these patterns to change the order but Im prob. not matching them correctly. How to capture up to my pattern, the pattern ("K__xyz") and then match to end of line, capture those groups, and reorganize? Is this the right approach?

Any help is much appreciated!

LP

oguz ismail
  • 1
  • 16
  • 47
  • 69
LP_640
  • 579
  • 1
  • 5
  • 17

2 Answers2

1

if you want to edit original file, add '-i' option;
sed -i -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file
if you want to save result to other file, remove '-i' option;
sed -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file > new.file

my test result:

szvp000006656:/home # cat test.file
37099   k__Eukaryota species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista

szvp000006656:/home # sed -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file > new.file
szvp000006656:/home # cat new.file
37099    species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista k__Eukaryota

szvp000006656:/home # sed -i -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file
szvp000006656:/home # cat test.file
37099    species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista k__Eukaryota

Note:

  1. It is recommended to use https://regexr.com/ to debug regular syntax
  2. Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Try this non-greedy regex [^/]* instead of .*? chaos-stackoverflow
  • You don't need the 1st capture group. `sed -E 's/(k__[^ ]+) (.*)/\2 \1/'` works to simply edit the part of the line beginning with the match. – stevesliva Oct 16 '21 at 23:34
0

Use this Perl one-liner:

perl -lpe 's{^(.*?\s)(k__\S+)\s+(.*)}{$1$3 $2}' test.file > out.file

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.

^ : Beginning of the line.
(.*?\s) : 0 or more of any characters (non-greedy), ending with whitespace, capture and store in variable $1.
(k__\S+) : Literal k__ followed by 1 or more non-whitespace characters, capture and store in variable $2.
\s+(.*) : 1 or more whitespace characters. Then 0 or more any characters, capture and store in variable $3.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
  • `perl -pe 's/k__\S+//; $k=$&; s/$/ $k/' ` -- have perl remove the string, which is saved as `$&`, then stash that in `$k`, then replace the line end with `$k` ... takes advantage of variables, which perl has but sed doesn't (other than the hold space). – stevesliva Oct 16 '21 at 23:41
  • Oh, or this: `perl -lpe 's/k__\S+//; $_="$_ $&"'` ... anyways, yay perl. – stevesliva Oct 17 '21 at 00:05