-1

I have weird list of items and lists like this with | as a delimiters and [[ ]] as a parenthesis. It looks like this:

| item1 | item2 | item3 | Ulist1[[ | item4 | item5 | Ulist2[[ | item6 | item7 ]] | item8 ]] | item9 | list3[[ | item10 | item11 | item12 ]] | item13 | item14

I want to match items in lists called Ulist* (items 4-8) using RegEx and replace them with Uitem*. The result should look like this:

| item1 | item2 | item3 | Ulist1[[ | Uitem4 | Uitem5 | Ulist2[[ | Uitem6 | Uitem7 ]] | Uitem8 ]] | item9 | list3[[ | item10 | item11 | item12 ]] | item13 | item14

I tryied almost everything I know about RegEx, but I haven't found any RegEx matching each item inside if the Ulists. My current RegEx:

/Ulist(\d+)\[\[(\s*(\|\s*[^\s\|]*)*\s*)*\]\]/i

What is wrong? I am beginner with RegEx.

It is in Python 2.7, specifically my code is:

    def fixDirtyLists(self, text):
        text = textlib.replaceExcept(text, r'Ulist(\d+)\[\[(\s*(\|\s*[^\s\|]*)*\s*)*\]\]', r'Ulist\1[[ U\3 ]]', '', site=self.site)
        return text

text gets that weird list, textlib replaces RegEx with RegEx. Not complicated at all.

aleskva
  • 1,644
  • 2
  • 21
  • 40

1 Answers1

1

If you install PyPi regex module (with Python 2.7.9+ it can be done by a mere pip install regex when in \Python27\Scripts\ folder), you will be able to match nested square brackets. You can match the strings you need, replace item with Uitem inside only those substrings.

The pattern (see demo, note that PyPi regex recursion resembles that of PCRE):

(Ulist\d+)(\[\[(?>[^][]|](?!])|\[(?!\[)|(?2))*]])
^-Group1-^^-----------Group2--------------------^

A short explanation: (Ulist\d+) is Group 1 that matches a literal word Ulist followed by 1 or more digits followed by (\[\[(?>[^][]|](?!])|\[(?!\[)|(?2))*]]) that matches substrings starting with [[ up to the corresponding ]].

And the Python code:

>>> import regex
>>> s = "| item1 | item2 | item3 | Ulist1[[ | item4 | item5 | Ulist2[[ | item6 | item7 ]] | item8 ]] | item9 | list3[[ | item10 | item11 | item12 ]] | item13 | item14"
>>> pat = r'(Ulist\d+)(\[\[(?>[^][]|](?!])|\[(?!\[)|(?2))*]])'
>>> res = regex.sub(pat, lambda m: m.group(1) + m.group(2).replace("item", "Uitem"), s)
>>> print(res)
| item1 | item2 | item3 | Ulist1[[ | Uitem4 | Uitem5 | Ulist2[[ | Uitem6 | Uitem7 ]] | Uitem8 ]] | item9 | list3[[ | item10 | item11 | item12 ]] | item13 | item14

To avoid modifying lists inside Ulist, use

def repl(m):
    return "".join([x.replace("item", "Uitem") if not x.startswith("list") else x for x in regex.split(r'\blist\d*\[{2}[^\]]*(?:](?!])[^\]]*)*]]', m.group(0))])

and replace the regex.sub with

res = regex.sub(pat, repl, s)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you, I'll try and I'll write, if it works for me or not, but it looks like great solution – aleskva Jan 20 '16 at 02:33
  • There is a problem, when there is a `list` inside of an `Ulist`. I don't want to change items of that list to Uitems, because it is not an Ulist. Do you know, how to solve this? – aleskva Jan 21 '16 at 08:22
  • It should be easier to process in a callback method. I need some time to check. – Wiktor Stribiżew Jan 21 '16 at 08:27
  • And could `x.replace()` or `m.group(2).replace()` functions replace regex too (in case of much more complicated item names)? – aleskva Jan 21 '16 at 09:19
  • 1
    You can use a **`regex.sub(regex_pattern, replacement, input)`**. – Wiktor Stribiżew Jan 21 '16 at 09:21
  • So e.g. `x.replace()` changes to `x.regex.sub()`? And the final code with updated answer would look like [this](http://pastebin.com/6Z8kDjq0)? – aleskva Jan 21 '16 at 09:30
  • 1
    `x.replace(old, new)` only uses regular string replacement. When using `re`/`regex`, you need `(re or regex).sub(regex_searching_for_old, pattern_replacement_with_new, x)`. I think the code at pastebin should work. – Wiktor Stribiżew Jan 21 '16 at 09:36
  • It works for `| item1 | item2 | item3 | Ulist1[[ | item4 | item5 | Ulist2[[ | item6 | item7 ]] | item8 ]] | item9 | list3[[ | item10 | item11 | item12 ]] | item13 | item14`, but if I modify it (to add a list inside of an Ulist), the script fails with error `Neoprávněný přístup do paměti (SIGSEGV) (core dumped [obraz paměti uložen])` (translation: unauthorized access to memory (SIGSEGV), core dumped, memory image saved) – aleskva Jan 21 '16 at 10:29
  • Modified list: `| item1 | item2 | item3 | Ulist1[[ | item4 | item5 | Ulist2[[ | item6 | item7 ]] | item8 | list4[[ | item15 | item16 | item17 ]] | item18 ]] | item9 | list3[[ | item10 | item11 | item12 ]] | item13 | item14` – aleskva Jan 21 '16 at 10:31
  • I am afraid that it can be a bug in the regex module. – Wiktor Stribiżew Jan 21 '16 at 10:56
  • Does your machine produce the same error? Could you report it if you know, where? I'll try to apply the code to my real case (not simplified testing case) and we'll see – aleskva Jan 21 '16 at 11:10
  • Also, please try to remove capturing groups since we are not interested in them if we use `repl`: `pat = r'Ulist\d+\[\[(?>[^][]|](?!])|\[(?!\[)|(?2))*]]'`. – Wiktor Stribiżew Jan 21 '16 at 11:12
  • I handled complicated cases by running modified script multiple times, so it is OK now. I have another similar list pattern, where I want to change `items` to `Uitems` in `Ulists`. I thought I could use the same solution for both, but it looks like I made a mistake when changing the code for the second case and I don't know, what's wrong. Please see [my question about the second case as well](http://stackoverflow.com/questions/34923554/regex-to-match-special-list-items-ii). – aleskva Jan 21 '16 at 12:22
  • Isn't that "memory" bug caused by some mistype in the argument of `regex.split()`? For me the regex inside does not make sense unlike the one in `pat` (but I'm a beginner) – aleskva Jan 21 '16 at 12:48
  • The regex inside `repl` is a good and very efficient regex based on an unroll-the-loop technique. There are no issues with the types I believe. – Wiktor Stribiżew Jan 21 '16 at 13:00
  • OK, then I hope they'll fix it somewhen. I also figured out the mistake in the second case, so I changed it and asked for help with that list inside Ulist issue – aleskva Jan 21 '16 at 13:16
  • Running script multiple times finally didn't helped. I also tried to add a list inside an Ulist inside another Ulist and test the code for this case. Memory error didn't appeared, but the text wasn't processed, so the modified answer isnẗ still working. – aleskva Jan 21 '16 at 13:26