2

I have a long string of text that's broken up by semi-colons, so I have a regex that captures [^\;]+. However, it's bugging because the content contains HTML apostrophes ( ' ).

How can I write a regex that will capture everything but the semi-colons unless the semi-colon is part of the HTML apostrophe?

thumbtackthief
  • 6,093
  • 10
  • 41
  • 87
  • 4
    Why aren't you parsing the HTML? – Blender Apr 18 '13 at 18:31
  • 5
    I swear, if I had a nickel for every time... http://stackoverflow.com/q/1732348/576139 – Chris Eberle Apr 18 '13 at 18:31
  • You can't. This isn't what regular expressions are for. – Ant P Apr 18 '13 at 18:31
  • Can we just have a bot that posts that link whenever someone has the terms HTML and regex in a single question? I don't get why people feel the need to make things harder for themselves. There are so many tools to do this exact job, and instead they want to reinvent the wheel with something really not designed for the job. – Gareth Latty Apr 18 '13 at 18:32
  • 1
    @thumbtackthief: You should entity-decode the HTML before splitting. – nhahtdh Apr 18 '13 at 18:35
  • @Chris I've already read that question. It is irrelevant to what I am trying to do. – thumbtackthief Apr 18 '13 at 18:36
  • @thumbtackthief: What version of BeautifulSoup? BS4 should convert the entities into unicode characters automatically: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup – Blender Apr 18 '13 at 18:36
  • No chance of just getting the answer to my question, which is how to make a regex that does what I"m asking? – thumbtackthief Apr 18 '13 at 18:37

1 Answers1

4
(&\S+?;|[^;])+

Match HTML entities as if they were single characters.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578