Looking for sentence in between "|"s in Regex

Question

I'm looking for anything in between these; '|' in data I scraped from a website. I've noticed, that '|' seperates all the stuff I'm interested in.

["{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n}"]

I want to print:

title=hello there!
subtitle=how are you
subsubtitle= I'm good, thanks

I think I should use look-behind and look-ahead, like this, but when it's in between the '|' characters, then it doesn't work.

I guess it's something like:

(?<=title=)(.*)(?=subtitle=)

(I'm very new to RegEx, but eager to learn!)

How does `subsubtitle= I'm good, thanks` qualify??? It is not in between `|`.. — Bhargav Rao, Apr 27 '15 at 10:24
You've got 2 problems with your pattern. First, `.*` is a greedy match. Second, you didn't put the `|` anywhere in the pattern. Combine the two, and `title` will match everything up to the _last_ `subtitle=`, which happens to be the one in the middle of `subsubtitle=`. You could do `(.*?)`, or `(?=\|subtitle=)`. — abarnert, Apr 27 '15 at 10:28
But, more simply, don't use all those look-behinds and look-aheads in the first place; what's wrong with the simpler `title=(.*?)\|subtitle=(.*?)\|subsubtitle=(.*?)`? — abarnert, Apr 27 '15 at 10:31

score 2 · Accepted Answer · answered Apr 27 '15 at 10:38

If you really must use regular expressions for this, don't overcomplicate them with unnecessary lookbehind and lookahead. Those bits are part of the pattern you're trying to match, just use them as such:

title=(.*?)[|]subtitle=(.*?)[|]subsubtitle=(.*?)}

Regular expression visualization

Debuggex Demo

Notice that I also included the | in your prefixes, because otherwise the | character is going to end up as part of each group. And I turned each of your greedy .* groups into a non-greedy .*?. That isn't actually necessary if you're matching all of the groups—but in your original example, it's the reason the title ended up including everything up to sub and the subsubtitle ended up as the subtitle. And finally, I put the } on the end so you don't end up with the whole outer grouping as part of the subsubtitle.

score 1 · Answer 2 · answered Apr 27 '15 at 10:26

You can use split() method:

In [5]: data = "{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n}"[1:-1]
In [6]: data
Out[6]: "somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n"
In [7]: data = data.replace("\n", "")
In [8]: data
Out[8]: "somethingsomething|title=hello there!|subtitle=how are you|subsubtitle=I'm good, thanks"
In [9]: words = data.split("|")
In [10]: words
Out[10]: 
['somethingsomething',
 'title=hello there!',
 'subtitle=how are you',
 "subsubtitle=I'm good, thanks"]
In [11]: title = words[1].split("=")[1]
In [12]: title
Out[12]: 'hello there!'
In [13]: suttitle =  words[2].split("=")[1]
In [14]: suttitle
Out[14]: 'how are you'
In [15]: subsuttitle = words[3].split("=")[1]
In [16]: subsuttitle
Out[16]: "I'm good, thanks"

score 1 · Answer 3 · answered Apr 27 '15 at 10:26

Regex is only necessary when dealing with complex string. Simple string like this can be handled using only string functions:

a = "[\"{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n}\"]"
b = a.lstrip('["{')
c = b.rstrip('}"]')
c.split('|')
# ['somethingsomething',
# 'title=hello there!\n',
# 'subtitle=how are you\n',
# "subsubtitle=I'm good, thanks\n"]

score 0 · Answer 4 · answered Apr 27 '15 at 10:26

A possible solution:

regex = re.compile(r'\["\{([^}]+)\}"\]')
match = regex.match('["{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I\'m good, thanks\n}"]')
match.groups()[0].split('|')

-> ['somethingsomething', 'title=hello there!\n', 'subtitle=how are you\n', "subsubtitle=I'm good, thanks\n"]

You might want to rstrip the strings afterwards.

zertap · Answer 5 · 2015-04-27T10:34:32.443

0

I think you could do:

string = '["{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I\'m good, thanks\n}"]'
string = string[3:-3]
# crop the three first and last characters from the string
sentences = string.split('|')
title = sentences[1]
...

This will inlcude the title= in the result

edited Apr 27 '15 at 10:34

answered Apr 27 '15 at 10:29

zertap

220
3
13

`sentences[0]` is `somethingsomething`, not the title. And just using `sentences[1]` doesn't help; you still need to split the `=`. – abarnert Apr 27 '15 at 10:32
oh, my bad, I thought he wanted to print with the 'labels' – zertap Apr 27 '15 at 10:35

score 0 · Answer 6 · answered Apr 27 '15 at 10:57

If you want to solve this using regular expressions, then one way is as below.

s = ["{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n}"]

match = re.search(r'title=(.*)\n', s[0])
if match:
    print "title={0}".format(match.group(1))

match = re.search(r'subtitle=(.*)\n', s[0])
if match:
    print "subtitle={0}".format(match.group(1))

match = re.search(r'subsubtitle=(.*)\n', s[0])
if match:
    print "subsubtitle={0}".format(match.group(1))

NarūnasK · Answer 7 · 2015-04-27T11:23:14.723

If you want regex with lookahead and lookbehind you can try following:

In [1]: import re

In [2]: s = "{somethingsomething|title=hello there!\n|subtitle=how are you\n|subsubtitle=I'm good, thanks\n}"

In [3]: m = re.findall(r"""(?<=\|)(?P<foo>.*?)(?:\=)(?P<bar>.*?(?=\n))""", s)

In [4]: for i,j in m:
   ...:     print "{} = {}".format(i,j)
   ...:     
title = hello there!
subtitle = how are you
subsubtitle = I'm good, thanks

Looking for sentence in between "|"s in Regex

7 Answers7