0

I need to pull out a single string containing the words from extracted fields:

[[cat]][[dog]][[mouse]][[apple]][[banana]][[pear]][[plum]][[pool]]

So from this I need: cat dog mouse apple banana pear plum pool.

I've been trying for 2 hours to make a regular expression for this.

The best I get is (?<=[[]\S)(.*)(?=]]) which gets me:

cat]][[dog]][[mouse]][[apple]][[banana]][[pear]][[plum]][[pool

Any ideas? Thanks!

midori
  • 4,807
  • 5
  • 34
  • 62
  • 1
    A simple search for characters would do. `/[a-z]+/g`. [Demo](https://regex101.com/r/cX0hA0/1) –  Feb 02 '16 at 22:42
  • 1
    Possible duplicate of [Difference between .\*? and .\* for regex](http://stackoverflow.com/questions/3075130/difference-between-and-for-regex) – HamZa Feb 02 '16 at 22:43
  • can the double brackets be nested? – timgeb Feb 02 '16 at 22:43
  • This really looks like an XY problem where you've created some badly formed data and now need to get at the information. Where is the data coming from? – the Tin Man Feb 02 '16 at 23:27

3 Answers3

1

Here's a solution with re.finditer. Let your string be s. This assumes there can be anything in between [[ and ]]. Otherwise, the comment by @noob applies.

>>> [x.group(1) for x in re.finditer('\[\[(.*?)\]\]', s)]
['cat', 'dog', 'mouse', 'apple', 'banana', 'pear', 'plum', 'pool']

Alternatively, with lookarounds and re.findall:

>>> re.findall('(?<=\[\[).*?(?=\]\])', s)
['cat', 'dog', 'mouse', 'apple', 'banana', 'pear', 'plum', 'pool']

For large strings, the finditer version seemed to be slightly faster when I timed the alternatives.

In [5]: s=s*1000
In [6]: timeit [x.group(1) for x in re.finditer('\[\[(.*?)\]\]', s)]
100 loops, best of 3: 3.61 ms per loop
In [7]: timeit re.findall('(?<=\[\[).*?(?=\]\])', s)
100 loops, best of 3: 5.93 ms per loop
timgeb
  • 76,762
  • 20
  • 123
  • 145
1

simple re.split will work:

>>> s = '[[cat]][[dog]][[mouse]][[apple]][[banana]][[pear]][[plum]][[pool]]'
>>> import re
>>> print re.split(r'[\[\]]{2,4}', s)[1:-1]
['cat', 'dog', 'mouse', 'apple', 'banana', 'pear', 'plum', 'pool']
midori
  • 4,807
  • 5
  • 34
  • 62
0

Do you have to do it with a regular expression?

extract = "[[cat]][[dog]][[mouse]][[apple]][[banana]][[pear]][[plum]][[pool]]"
word_list = [word for word in extract.replace('[', '').split(']') if word != '']
print word_list

Output:

['cat', 'dog', 'mouse', 'apple', 'banana', 'pear', 'plum', 'pool']

Got it with regular expressions now. SImply find non-empty strings of stuff without brackets.

import re

target = "[[cat]][[dog]][[mouse]][[apple]][[banana]][[pear]][[plum]][[pool]]"
word_list = ' '.join(re.findall("[^\[\]]+", target))
print word_list

Edited to return the single string, rather than a list of strings.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • No I dont have too. I had been solving a few of my cleaning text issues with them so i just kept trying them. This did work though. Thanks! –  Feb 02 '16 at 22:46