1

I am working on a Python code which extracts specific elements from websites and the print it on a GUI implemented through the tkinter module. To extract specific elements from a webpage require the use of regex to which I am currently new and though I am able to obtain various elements, I am still finding it difficult to extract certain elements. One such example is presented below.

<div class="updated published time-details"><a class="url" 
    href="https://thetriffid.com.au/gig/chocolate-starfish-one-last-kick/" 
    title="CHOCOLATE STARFISH (AUS) &#8220;ONE LAST KICK&#8221;" 
    rel="bookmark"><span class="tribe-event-date-start">Sat Aug 3 @ 8:00 
    pm</span>
    </a>
</div>

This is a part of HTML code from which I just need the title i.e. "Chocolate Starfish (AUS) & One Last Kick". I am using the findall method and we are not allowed to use another external library such as Beautiful Soup. So, we have to work with findall, finditer, MULTILINE and DOTALL.

How do I get the desired outcome?

Emma
  • 27,428
  • 11
  • 44
  • 69
Joe Julen
  • 71
  • 6
  • `print(re.findall(r'(?<=title=")[^"]+', html))` should be enough, if you're not allowed to use `Beautiful Soup` or other html parsers – Pushpesh Kumar Rajwanshi May 09 '19 at 17:55
  • 1
    [You cannot parse HTML with regular expressions](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – mustaccio May 09 '19 at 17:59
  • Thanks for the help dear. Can you just help me with this one last piece? Sat Aug 3 @ 8:00 pm I want to extract the date from this. – Joe Julen May 09 '19 at 18:02
  • @JoeJulen: Use `print(re.findall(r'(.+?)', html))` for date but it would have been really helpful to use a parser as you are using multiple information and regex will be costly for such a thing. Do you mind explaining what stops you to use parsers? – Pushpesh Kumar Rajwanshi May 09 '19 at 18:05
  • This is part of a project and out leader has restricted us to use other parsers. The code snippets you have provided works perfectly, but I am still trying to understand how they work :-) – Joe Julen May 09 '19 at 18:12

2 Answers2

2

Using an HTML-aware solution like BeautifulSoup would handle more cases, but if you're sure the HTML will always conform to your example, you can use a rough regex match like:

re.findall('<a.*? title=\"(.*?)\"', html, re.DOTALL)
# ['CHOCOLATE STARFISH (AUS) &#8220;ONE LAST KICK&#8221;']
jspcal
  • 50,847
  • 7
  • 72
  • 76
1

This is a good regex to find 'a' tags with 'title' attribute which is in Group 2.

Stringed

r"(?si)<a(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?\stitle\s*=\s*(['\"])(.*?)\1)(?:\".*?\"|'.*?'|[^>]*?)+>"

Readable version

 (?si)

 <a
 (?=
      (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
      \s title \s* = \s* 
      ( ['"] )                      # (1)
      ( .*? )                       # (2)
      \1 
 )
 (?: " .*? " | ' .*? ' | [^>]*? )+
 >

Benchmark using a large web page (cnn.com) and 300 iterations

Regex1:   (?si)<a(?=(?:[^>"']|"[^"]*"|'[^']*')*?\stitle\s*=\s*(['"])(.*?)\1)(?:".*?"|'.*?'|[^>]*?)+>
Options:  < none >
Completed iterations:   300  /  300     ( x 1 )
Matches found per iteration:   285
Elapsed Time:    3.26 s,   3262.08 ms,   3262081 µs
Matches per sec:   26,210