Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

211 questions

318

votes

36 answers

Extracting text from HTML file using Python

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad. I'd like something more robust than using regular expressions that may…

asked Nov 30 '08 at 02:28

John D. Cook

29,517
10
67
94

235

votes

11 answers

Extract part of a regex match

I want a regular expression to extract the title from a HTML page. Currently I have this: title = re.search('.*', html, re.IGNORECASE).group() if title: title = title.replace('', '').replace('', '') Is there a…

python html regex html-content-extraction

asked Aug 25 '09 at 10:24

hoju

28,392
37
134
178

165

votes

10 answers

How to extract img src, title and alt from html using php?

I would like to create a page where all images which reside on my website are listed with title and alternative representation. I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and…

php html regex html-parsing html-content-extraction

asked Sep 26 '08 at 08:33

Sam

28,421
49
167
247

151

votes

11 answers

How to scrape only visible webpage text with BeautifulSoup?

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the…

python web-scraping text beautifulsoup html-content-extraction

asked Dec 20 '09 at 17:55

user233864

1,727
2
13
12

votes

3 answers

Using BeautifulSoup to find a HTML tag that contains certain text

I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}

this is cool #12345678901

So, the previous would match by using: soup('h2',text=re.compile(r' #\S{11}')) And the results would be…

python regex beautifulsoup html-content-extraction

asked May 14 '09 at 21:46

sotangochips

2,700
6
28
38

votes

9 answers

parsing HTML on the iPhone

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate. Does such a library exist, or am I better off just trying to use regular expressions?

iphone html parsing html-content-extraction

asked Jan 02 '09 at 00:37

Sophie Alpert

139,698
36
220
238

votes

15 answers

What is the best way to parse html in C#?

I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.

c# .net html parsing html-content-extraction

asked Sep 11 '08 at 09:16

Luke

18,585
24
87
110

votes

12 answers

"Smart" way of parsing and using website data?

How does one intelligently parse data returned by search results on a page? For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get…

web-services parsing html html-content-extraction

asked Aug 03 '09 at 17:04

bluebit

2,987
7
34
42

votes

11 answers

regular expression to extract text from HTML

I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or more) that will achieve that?

html regex html-content-extraction text-extraction

asked Oct 08 '08 at 01:43

Ron Harlev

16,227
24
89
132

votes

2 answers

Create Great Parser - Extract Relevant Text From HTML/Blogs

I'm trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie's URL and get back clean text of the post itself. My basic approach (from python) has been to use a combination of…

html parsing text-parsing html-content-extraction

asked Jul 18 '09 at 07:27

nartz

votes

5 answers

How do you parse an HTML in vb.net

I would like to know if there is a simple way to parse HTML in vb.net. I know that HTML is not sctrict subset of XML, but it would be nice if it could be treated that way. Is there anything out there that would let me parse HTML in an XML-like way…

.net html vb.net parsing html-content-extraction

asked Feb 05 '09 at 16:59

tooleb

votes

8 answers

C# - Best Approach to Parsing Webpage?

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this? I've tried saving the string as an .xml…

c# html xml html-content-extraction

asked Nov 18 '08 at 21:46

MattSayar

2,078
6
23
28

votes

8 answers

Text Extraction from HTML Java

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code…

java html screen-scraping html-content-extraction text-extraction

asked Sep 06 '09 at 16:52

MajorMajor

votes

8 answers

What is the state of the art in HTML content extraction?

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of…

html html-content-extraction text-extraction

asked Dec 26 '09 at 01:22

Charles Stewart

11,661
4
46
85

votes

3 answers

How can I read and parse the contents of a webpage in R

I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it

html r screen-scraping html-content-extraction

asked Dec 04 '09 at 04:18

Mark

10,754
20
60
81

2 3

…

14 15 Next