0

I'm working with manipulating lines from a .vcf file where bread is listed 1 through 20 in roman numerals. I only want the lines corresponding to bread 10, so I've used

awk '/breadX/ {print}' file.vcf > Test.txt

to output a list of lines containing "breadX" to Test.txt. That is all good, however it is also including "breadXI" on to "breadXX" in the list. Is there an option to exclude cases that don't match assuming "breadX" is out of order and towards the middle (XIV...X...XX), and that there is more information in the line. I only want lines that start with bread 10, and not any of the other options. Any help would be appreciated.

Jotne
  • 40,548
  • 12
  • 51
  • 55
Omlethead
  • 3
  • 1
  • 1
    The roman numeral part is not that relevant - you need a more complex regex, but that depends on what the data looks like. Can you provide a representative sample? – Ian McGowan Sep 19 '19 at 04:12
  • Related: https://stackoverflow.com/questions/267399 – kvantour Sep 19 '19 at 07:59

2 Answers2

2

In the lack of definitive data sample to see what might follow the breadX just exclude all possible strings where roman numeral symbols I, V, X, L, D, M follow:

$ awk '/^breadX([^IVXLDM]|$)/' file

Sample test file:

$ cat file
breadX
breadXI
breadX2
3

Test it:

$ awk '/^breadX([^IVXLDM]|$)/' file

Output:

breadX
breadX2
James Brown
  • 36,089
  • 7
  • 43
  • 59
0

If breadX is a word, you can use word boundary to limit your search.

cat file
test breadXI more
hi breadX yes
cat home breadXX 

awk '/\<breadX\>/' file
hei breadX yes
  • \< start of word
  • \> end of word

PS you do not need the print since its default action if test is true.

Jotne
  • 40,548
  • 12
  • 51
  • 55