0

I have a command like this, it is marking words to appear in an index in the document:

sed -i "s/\b$line\b/\\\keywordis\{$line\}\{$wordis\}\{$definitionis\}/g" file.txt

The problem is, it is finding matches within existing matches, which means its e.g. "hello" is replaced with \keywordis{hello}{a common greeting}, but then "greeting" might be searched too, and \keywordis{hello}{a common \keywordis{greeting}{a phrase used when meeting someone}}...

How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?

  • Curley brackets in this case will always appear on the same line.
Village
  • 22,513
  • 46
  • 122
  • 163
  • Why `sed`? Why not use an actual programming language? `but then "greeting" might be searched too, an` Creating a state machine in sed is __extremely__ hard. It is "possible", but doing it in `sed` is just pointless, except for academic purposes. Write a real parser in Perl or Python. – KamilCuk Dec 20 '21 at 17:06
  • What are the content of `$line`? You seem to be asking XY question - you ask about sed. Don't you want to ask how to apply specific formating to you latex documents? – KamilCuk Dec 20 '21 at 17:17

1 Answers1

0

How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?

First tokenize input. Place something unique, like | or byte \x01 between every \keywordis{hello}{a common greeting} and store that in hold space. Something along s/\\the regex to match{hello}{a common greeting}/\x01&\x01/g'.

Ten iterate over elements in hold space. Use \n to separate elements already parsed from not parsed - input from output. If the element matches the format \keywordis{hello}{a common greeting}, just move it to the front before the newline in hold space, if it does not, perform the replacement. Here's an example: Identify and replace selective space inside given text file , it uses double newline \n\n as input/output separator.

Because, as you noted, replacements can have overlapping words with the patterns you are searching for, I believe the simplest would be after each replacement shuffling the pattern space like for ready output and starting the process all over for the current line.

Then on the end, shuffle the hold space to remove \x01 and newline and any leftovers and output.

Overall, it's Latex. I believe it would be simpler to do it manually.


By "eating" the string from the back and placing it in front of input/output separator inside pattern space, I simplified the process. The following program:

sed '
    # add our input/output separator - just a newline
    s/^/\n/

    : loop
    # l1000
    # Ignore any "\keywords" and "{stuff}"
    /^\([^\n]*\)\n\(.*\)\(\\[^{}]*\|{[^{}]*}\)$/{
        s//\3\1\n\2/
        b loop
    }
    # Replace hello followed by anthing not {}
    # We match till the end because regex is greedy
    # so that .* will eat everything.
    /^\([^\n]*\)\n\(.*\)hello\([{}]*\)$/{
        s//\\keywordis{hello}{a common greeting}\3\1\n\2/
        b loop
    }
    # Hello was not matched - ignore anything irrelevant
    # note - it has to match at least one character after newline       
    /^\([^\n]*\)\n\(.*\)\([^{}]\+\)$/{
        s//\3\1\n\2/
        b loop
    }

    s/\n//
' <<<'
\keywordis{hello}{hello} hello {some other hello} another hello yet
'

outputs:

\keywordis{hello}{hello} \keywordis{hello}{a common greeting} {some other hello} another \keywordis{hello}{a common greeting} yet
KamilCuk
  • 120,984
  • 8
  • 59
  • 111