1

I'm looking for anything in between these; '|' in data I scraped from a website. I've noticed, that '|' seperates all the stuff I'm interested in.

["{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n}"]

I want to print:

title=hello there!
subtitle=how are you
subsubtitle= I'm good, thanks

I think I should use look-behind and look-ahead, like this, but when it's in between the '|' characters, then it doesn't work.

I guess it's something like:

(?<=title=)(.*)(?=subtitle=)

(I'm very new to RegEx, but eager to learn!)

Community
  • 1
  • 1
  • How does `subsubtitle= I'm good, thanks` qualify??? It is not in between `|`.. – Bhargav Rao Apr 27 '15 at 10:24
  • You've got 2 problems with your pattern. First, `.*` is a greedy match. Second, you didn't put the `|` anywhere in the pattern. Combine the two, and `title` will match everything up to the _last_ `subtitle=`, which happens to be the one in the middle of `subsubtitle=`. You could do `(.*?)`, or `(?=\|subtitle=)`. – abarnert Apr 27 '15 at 10:28
  • But, more simply, don't use all those look-behinds and look-aheads in the first place; what's wrong with the simpler `title=(.*?)\|subtitle=(.*?)\|subsubtitle=(.*?)`? – abarnert Apr 27 '15 at 10:31

7 Answers7

2

If you really must use regular expressions for this, don't overcomplicate them with unnecessary lookbehind and lookahead. Those bits are part of the pattern you're trying to match, just use them as such:

title=(.*?)[|]subtitle=(.*?)[|]subsubtitle=(.*?)}

Regular expression visualization

Debuggex Demo

Notice that I also included the | in your prefixes, because otherwise the | character is going to end up as part of each group. And I turned each of your greedy .* groups into a non-greedy .*?. That isn't actually necessary if you're matching all of the groups—but in your original example, it's the reason the title ended up including everything up to sub and the subsubtitle ended up as the subtitle. And finally, I put the } on the end so you don't end up with the whole outer grouping as part of the subsubtitle.

abarnert
  • 354,177
  • 51
  • 601
  • 671
1

You can use split() method:

In [5]: data = "{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n}"[1:-1]
In [6]: data
Out[6]: "somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n"
In [7]: data = data.replace("\n", "")
In [8]: data
Out[8]: "somethingsomething|title=hello there!|subtitle=how are you|subsubtitle=I'm good, thanks"
In [9]: words = data.split("|")
In [10]: words
Out[10]: 
['somethingsomething',
 'title=hello there!',
 'subtitle=how are you',
 "subsubtitle=I'm good, thanks"]
In [11]: title = words[1].split("=")[1]
In [12]: title
Out[12]: 'hello there!'
In [13]: suttitle =  words[2].split("=")[1]
In [14]: suttitle
Out[14]: 'how are you'
In [15]: subsuttitle = words[3].split("=")[1]
In [16]: subsuttitle
Out[16]: "I'm good, thanks"
Michael Kazarian
  • 4,376
  • 1
  • 21
  • 25
1

Regex is only necessary when dealing with complex string. Simple string like this can be handled using only string functions:

a = "[\"{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n}\"]"
b = a.lstrip('["{')
c = b.rstrip('}"]')
c.split('|')
# ['somethingsomething',
# 'title=hello there!\n',
# 'subtitle=how are you\n',
# "subsubtitle=I'm good, thanks\n"]
ljk321
  • 16,242
  • 7
  • 48
  • 60
0

A possible solution:

regex = re.compile(r'\["\{([^}]+)\}"\]')
match = regex.match('["{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I\'m good, thanks\n}"]')
match.groups()[0].split('|')

-> ['somethingsomething', 'title=hello there!\n', 'subtitle=how are you\n', "subsubtitle=I'm good, thanks\n"]

You might want to rstrip the strings afterwards.

Klaus D.
  • 13,874
  • 5
  • 41
  • 48
0

I think you could do:

string = '["{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I\'m good, thanks\n}"]'
string = string[3:-3]
# crop the three first and last characters from the string
sentences = string.split('|')
title = sentences[1]
...

This will inlcude the title= in the result

zertap
  • 220
  • 3
  • 13
0

If you want to solve this using regular expressions, then one way is as below.

s = ["{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n}"]

match = re.search(r'title=(.*)\n', s[0])
if match:
    print "title={0}".format(match.group(1))

match = re.search(r'subtitle=(.*)\n', s[0])
if match:
    print "subtitle={0}".format(match.group(1))

match = re.search(r'subsubtitle=(.*)\n', s[0])
if match:
    print "subsubtitle={0}".format(match.group(1))
Mohd Ali
  • 311
  • 2
  • 13
0

If you want regex with lookahead and lookbehind you can try following:

In [1]: import re

In [2]: s = "{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n}"

In [3]: m = re.findall(r"""(?<=\|)(?P<foo>.*?)(?:\=)(?P<bar>.*?(?=\n))""", s)

In [4]: for i,j in m:
   ...:     print "{} = {}".format(i,j)
   ...:     
title = hello there!
subtitle = how are you
subsubtitle = I'm good, thanks
NarūnasK
  • 4,564
  • 8
  • 50
  • 76