3

How can I match and delete all comments from the line? I can delete comments starting from new line, or the ones not in quotes using sed. But my script fails in the following examples

This one "# this is not a comment" # but this "is a comment"

Can sed handle this case? if yes what is the regex?

Example:

  • Input:

    This one "# this is not a comment" # but this "is a comment" 
    
  • Output:

    This one "# this is not a comment"
    
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Inventor
  • 111
  • 1
  • 7
  • post an example along with the expected output. – Avinash Raj Oct 26 '14 at 17:56
  • A comment is a word (obeying bash's convoluted quoting rules) which starts with an unquoted #. That's very tricky to recognize with a regular expression, because determining the range of a quoted expression can only be done with an understanding of bash syntax. And the syntax is recursive: `cmd "$(other_cmd #comment` is legal (though not complete). – rici Oct 26 '14 at 18:03
  • I know it is tricky. Spend lots of time trying to figure it out... It means there is no simple solution using sed? – Inventor Oct 26 '14 at 18:05
  • 2
    There are no simple solutions using `sed`. Not even complicated ones. Perhaps there are insanely complicated ones, though, but nobody even bothered to think about it. – gniourf_gniourf Oct 26 '14 at 18:09
  • Thank you very much. Maybe you could recommend some other tools I can use in bash? – Inventor Oct 26 '14 at 18:10
  • @user2590816: Can you tell us why you want to strip that comments? Are you parsing a configuration file? – firegurafiku Oct 26 '14 at 18:38
  • 1
    possible duplicate of [how to remove comments from a bash script](http://stackoverflow.com/questions/25291228/how-to-remove-comments-from-a-bash-script) – rici Oct 26 '14 at 19:38
  • @user2590816: The best tool is often bash itself :) See my answer to the question I marked as duplicat. – rici Oct 26 '14 at 19:40
  • Many people in this forum are trying to use `sed`,`grep`,`awk` as an interpreter, compiler, parser of XML, CSV or some other format. They are just not made for this task! They can work in some cases but almost every time you can find a little bit more complex input when the solution will fail. – Vytenis Bivainis Oct 26 '14 at 21:13
  • @Vytenis Can you suggest some tools that are made for this task? – Inventor Oct 26 '14 at 21:15
  • javacc/jjtree probably, but it's a steep curve to learn. – Vytenis Bivainis Oct 26 '14 at 21:39

2 Answers2

1

You can use a lexical analyzer like Flex directly applied to the script. In its manual you can find "How can I match C-style comments?" and I think that you can adapt that part to your problem.

If you need an in-depth tutorial, you can find it here; under "Lexical Analysis" section you can find a pdf that introduce you to the tool and an archive with some practical examples, including "c99-comment-eater", which you can draw inspiration from.

rici
  • 234,347
  • 28
  • 237
  • 341
Alberto Coletta
  • 1,563
  • 2
  • 15
  • 24
  • thanks for information. I ended up writing C-style code which traces each character on the line and stops as soon as it sees '#' outside double quotes. Don't think it is efficient solution, but at least it works – Inventor Oct 26 '14 at 20:01
1

If we assume that # is not a comment when it is in quotes or escaped with backslash, then we can define the following regex:

(ES|RT|QT)*C?

where

ES - escape sequence: \ followed by 1 char

\\.

RT - non-special regular text

[^"\\#]*

QT - text in quotes

"[^"]*"

C - comment starting with unescaped, unquoted hash sign # and ending with the end of line

#.*

The possible solution using sed:

sed 's/^\(\(\\.\|[^"\\#]*\|"[^"]*"\)*\)#.*$/\1/'
QWERTY21KG
  • 46
  • 6