2

I have the following text:

string = "<i>R</i> subspace  <i>{V.</i> generated by <i>{v<sub>1</sub>,...,v<sub>i</sub></i>, "

A careful reader might notice that there are two brackets missing. I was wondering, how this could be fixed using Python?

The expected output would be:

the <i>R</i>  subspace  <i>{V.}</i> generated by <i>{v<sub>1</sub>,...,v<sub>i</sub>}</i>,

One could:

  1. Check: Is there a bracket after <i> ?
  2. If yes -> Is there a bracket before </i> ?

How can I code this?

Edit

I have found this code, that can tell you if the brackets match or not.

halfer
  • 19,824
  • 17
  • 99
  • 186
henry
  • 875
  • 1
  • 18
  • 48
  • What goes inside the brackets? Can "" appear inside brackets (i.e. does "{v" turn into "{v}" or "{v}")? – manveti Apr 11 '19 at 21:16
  • @CraigMeier Good point. The second case would be the correct one. – henry Apr 11 '19 at 21:19
  • @CraigMeier Would you know how to already code the simpler case (first case) ? – henry Apr 11 '19 at 21:27
  • The first case can be handled by a regex (e.g. `re.sub` with a callable `repl` param that ensures there's a final '}' for any chunk with an initial '{'). But the second case falls under https://stackoverflow.com/a/1732454/ – manveti Apr 11 '19 at 21:34
  • @CraigMeier Thanks for your input, but I don't know how to code it with `repl`. Sorry to bother you... I see that I can do: re.sub(r', SOMETHING, string) – henry Apr 11 '19 at 21:45
  • Another great question. However, as with your previous HTML repair question, is there any significance to the `i` in ``, or should it work on arbitrary tags? – ggorlen Apr 11 '19 at 22:02
  • how many times this function has to be called? What generated this string? why cant you fix it manually – aaaaa says reinstate Monica Apr 11 '19 at 22:02
  • @ggorlen Thanks for your question: First if would be good to have it working for the `i `but of course, an extinction to arbitrary tags would be very useful for a much wider community ! – henry Apr 11 '19 at 22:38

1 Answers1

2

How about the following regex solution:

import re

string = "<i>R</i> subspace  <i>{V.</i> generated by <i>{v<sub>1</sub>,...,v<sub>i</sub></i>, "
expected = "<i>R</i> subspace  <i>{V.}</i> generated by <i>{v<sub>1</sub>,...,v<sub>i</sub>}</i>, "

fixed = re.sub(r"<(?P<tag>.*?)>({.*?)</(?P=tag)>", r"<\1>\2}</\1>", string)

print(fixed == expected) # True

The idea is to match a tag followed by a brace, find its closing tag, and plop the companion brace before the closing tag using the capture groups as <\1>\2}</\1>. Breakdown:

< # literal opening bracket
 (?P<tag> # open a named capture group
         .*? # lazily match any characters
            ) # end named capture group
             > # literal closing bracket
              ( # open capture group 2
               { # literal opening brace
                .*? # lazily match any characters
                   ) # end capture group 2
                    < # literal opening bracket
                     / # literal slash
                      (?P=tag) # backreference to the named group
                              > # literal closing bracket

If you just want <i>, you can use re.sub(r"<i>({.*?)</i>", r"<i>\1}</i>", string).

ggorlen
  • 44,755
  • 7
  • 76
  • 106