How can I use RegEx to extract info from a html document

Question

I am trying to figure out how to extract some info from a html document using RegEx (it must be regex not any other html parser). The html document I want to extract from is called: "website1.html". It has this data below:

<div class="category"><div class="comedy">Category1</div></div>
   <p class="desc">Title1</p>
   <p class="date">Date1/p>

<div class="category"><div class="comedy">Category2</div></div>
   <p class="desc">Title2</p>
   <p class="date">Date2/p>

How could I first select the html document so that python can read it, and then extract the information from: class="comedy", class="desc", and class="date" using regex findall expressions?

I want them to be in separate lists so that I end up with: ["Title1", "Title2"] in one list and ["Category1", "Category2"] in another etc.

I have the overall process mapped in my head but I dont know the specific characters/functions to use.

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — James Steele, May 24 '19 at 13:52
Possible duplicate of [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg) — Bharel, May 24 '19 at 13:53
instead of Regex use a [`proper web scraper`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start) — Kunal Mukherjee, May 24 '19 at 13:53

Ahmed Yousif · Accepted Answer · 2019-05-24T14:18:16.230

1

You can accomplish it using regular expression as the following example:

import re

filename = 'path\\website1.html'
t = open(filename, "r").read()

categories = re.findall(r"<div class=\"comedy\">(.*?)</div>",t)
descs = re.findall(r"<p class=\"desc\">(.*?)</p>",t)
dates = re.findall(r"<p class=\"date\">(.*?)/p>",t)

# Print Your code here
print(categories)
print(descs)
print(dates)

the result:

['Category1', 'Category2']
['Title1', 'Title2']
['Date1', 'Date2']

but I noted that your html is not well formatted (<p class="date">Date2/p>) I do it according to your example.

edited May 24 '19 at 14:18

answered May 24 '19 at 14:08

Ahmed Yousif

2,298
1
11
17

How can I make it so it extracts information straight from HTML file instead of pasting the file into my python script? (What do I put instead of "t") – Jason7261 May 24 '19 at 14:11
1

@Jason7261 I have been updated the answer check it. – Ahmed Yousif May 24 '19 at 14:19

How can I use RegEx to extract info from a html document

1 Answers1