Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

211 questions
318
votes
36 answers

Extracting text from HTML file using Python

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad. I'd like something more robust than using regular expressions that may…
John D. Cook
  • 29,517
  • 10
  • 67
  • 94
235
votes
11 answers

Extract part of a regex match

I want a regular expression to extract the title from a HTML page. Currently I have this: title = re.search('.*', html, re.IGNORECASE).group() if title: title = title.replace('', '').replace('', '') Is there a…
hoju
  • 28,392
  • 37
  • 134
  • 178
165
votes
10 answers

How to extract img src, title and alt from html using php?

I would like to create a page where all images which reside on my website are listed with title and alternative representation. I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and…
Sam
  • 28,421
  • 49
  • 167
  • 247
151
votes
11 answers

How to scrape only visible webpage text with BeautifulSoup?

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the…
user233864
  • 1,727
  • 2
  • 13
  • 12
73
votes
3 answers

Using BeautifulSoup to find a HTML tag that contains certain text

I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}

this is cool #12345678901

So, the previous would match by using: soup('h2',text=re.compile(r' #\S{11}')) And the results would be…
sotangochips
  • 2,700
  • 6
  • 28
  • 38
68
votes
9 answers

parsing HTML on the iPhone

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate. Does such a library exist, or am I better off just trying to use regular expressions?
Sophie Alpert
  • 139,698
  • 36
  • 220
  • 238
66
votes
15 answers

What is the best way to parse html in C#?

I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.
Luke
  • 18,585
  • 24
  • 87
  • 110
33
votes
12 answers

"Smart" way of parsing and using website data?

How does one intelligently parse data returned by search results on a page? For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get…
bluebit
  • 2,987
  • 7
  • 34
  • 42
22
votes
11 answers

regular expression to extract text from HTML

I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or more) that will achieve that?
Ron Harlev
  • 16,227
  • 24
  • 89
  • 132
22
votes
2 answers

Create Great Parser - Extract Relevant Text From HTML/Blogs

I'm trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie's URL and get back clean text of the post itself. My basic approach (from python) has been to use a combination of…
nartz
19
votes
5 answers

How do you parse an HTML in vb.net

I would like to know if there is a simple way to parse HTML in vb.net. I know that HTML is not sctrict subset of XML, but it would be nice if it could be treated that way. Is there anything out there that would let me parse HTML in an XML-like way…
tooleb
  • 612
  • 3
  • 6
  • 14
19
votes
8 answers

C# - Best Approach to Parsing Webpage?

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this? I've tried saving the string as an .xml…
MattSayar
  • 2,078
  • 6
  • 23
  • 28
19
votes
8 answers

Text Extraction from HTML Java

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code…
MajorMajor
18
votes
8 answers

What is the state of the art in HTML content extraction?

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of…
Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
16
votes
3 answers

How can I read and parse the contents of a webpage in R

I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it
Mark
  • 10,754
  • 20
  • 60
  • 81
1
2 3
14 15