Regex for X or not Y

Question

I have a long string of text that's broken up by semi-colons, so I have a regex that captures [^\;]+. However, it's bugging because the content contains HTML apostrophes ( ' ).

How can I write a regex that will capture everything but the semi-colons unless the semi-colon is part of the HTML apostrophe?

I swear, if I had a nickel for every time... http://stackoverflow.com/q/1732348/576139 — Chris Eberle, Apr 18 '13 at 18:31
Can we just have a bot that posts that link whenever someone has the terms HTML and regex in a single question? I don't get why people feel the need to make things harder for themselves. There are so many tools to do this exact job, and instead they want to reinvent the wheel with something really not designed for the job. — Gareth Latty, Apr 18 '13 at 18:32
@thumbtackthief: You should entity-decode the HTML before splitting. — nhahtdh, Apr 18 '13 at 18:35
@Chris I've already read that question. It is irrelevant to what I am trying to do. — thumbtackthief, Apr 18 '13 at 18:36
@thumbtackthief: What version of BeautifulSoup? BS4 should convert the entities into unicode characters automatically: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup — Blender, Apr 18 '13 at 18:36
No chance of just getting the answer to my question, which is how to make a regex that does what I"m asking? — thumbtackthief, Apr 18 '13 at 18:37

score 4 · Accepted Answer · answered Apr 18 '13 at 18:32

4

(&\S+?;|[^;])+

Match HTML entities as if they were single characters.

answered Apr 18 '13 at 18:32

John Kugelman

349,597
67
533
578

Regex for X or not Y

1 Answers1