0

I have the following regular expression to extract song names from a certain website:

<h2 class="chart-row__song">(.*?)</h2>

It displays the results below :

Where &#039; is in the output below, is an apostrophe on the website the song name is extract from.

How would I go about changing my regular expression to remove those characters? &#039;

Python output

TIA

  • Oops, you forgot to post your code! StackOverflow is about helping people fix their code. It's not a free coding service. Any code is better than no code at all. Meta-code, even, will demonstrate how you're thinking a program should work, even if you don't know how to write it. – ghoti May 21 '16 at 13:04
  • @ghoti The line starting with

    is the code 'regular expression'...

    –  May 21 '16 at 13:05
  • 1
    Firstly, [don't parse HTML with regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Secondly, a regex is not suited to this task. Those things are called HTML entities, do a search on them. – Alex Hall May 21 '16 at 13:05
  • That doesn't look like python to me. A regular expression is used to MATCH content, not alter it (like removing strings from it). You've tagged your queston [tag:python], so please include your attempt to achieve this in python. – ghoti May 21 '16 at 13:06
  • @ghoti it's not really far to expect a beginner to understand what's going on here. They could do a simple string replace and it would work but be a poor solution. – Alex Hall May 21 '16 at 13:07
  • @AlexHall I agree; the standard processes for handling poor-quality questions don't really accommodate [XY problems](http://mywiki.wooledge.org/XyProblem) like this. (I.e. "Please help me *use foo* to achieve bar" rather than "Please help me *achieve bar*".) Nevertheless, instructions [are readily available](http://stackoverflow.com/help/how-to-ask), and I feel that we improve the site by encouraging those who use them. – ghoti May 21 '16 at 13:10
  • I posted this in the wrong section, sorry people –  May 21 '16 at 13:31

1 Answers1

1

As stated in the comments, you can't do that using a regex alone. You need to unescape the HTML entities present in the match separately.

import re
import html
regex = re.compile(r'<h2 class="chart-row__song">(.*?)</h2>')
result = [html.unescape(s) for s in regex.findall(mystring)]
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561