Questions tagged [html-parsing]

HTML parsing is the process of consuming a serialization of an HTML document and producing a representation that you can work with programmatically — e.g., in order to extract data from it. The HTML specification defines a standard algorithm for parsing HTML, which is implemented in all major browsers.

HTML parsing typically involves converting an HTML document to a tree-based Document Object Model (DOM)

https://html.spec.whatwg.org/multipage/parsing.html#parsing has the standard algorithm for parsing HTML, which is implemented in all major browsers.

How do you parse and process HTML/XML in PHP?

How can one parse HTML/XML and extract information from it?

php html xml xml-parsing html-parsing

asked Aug 26 '10 at 17:17

RobertPitt

56,863
21
114
161

402

votes

15 answers

Parse an HTML string with JS

I want to parse a string which contains HTML text. I want to do it in JavaScript. I tried the Pure JavaScript HTML Parser library but it seems that it parses the HTML of my current page, not from a string. Because when I try the code below, it…

javascript html dom html-parsing

asked May 14 '12 at 14:11

stage

4,225
4
15
8

309

votes

4 answers

How to strip HTML tags from string in JavaScript?

How can I strip the HTML from a string in JavaScript?

javascript html-parsing

asked Feb 15 '11 at 09:56

f.ardelian

6,716
8
36
53

254

votes

7 answers

Parsing HTML using Python

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects. If I have a document of the form: Heading

…

python xml-parsing html-parsing

asked Jul 29 '12 at 12:00

ffledgling

11,502
8
47
69

236

votes

18 answers

Using regular expressions to parse HTML: why not?

It seems like every question on stackoverflow where the asker is using regex to grab some information from HTML will inevitably have an "answer" that says not to use regex to parse HTML. Why not? I'm aware that there are quote-unquote "real" HTML…

regex html-parsing

asked Feb 26 '09 at 14:24

ntownsend

7,462
9
38
35

208

votes

3 answers

How can I efficiently parse HTML with Java?

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation. Now, I want to separate both the tasks. I want to use a light HTML parser because it takes much time in…

java html parsing html-parsing web-scraping

asked Jan 30 '10 at 16:52

Amit

33,847
91
226
299

201

votes

8 answers

What to do when a regular expression pattern doesn't match anywhere in a string?

I am trying to match of type hidden fields using this pattern: // This is some sample form data:

regex html-parsing

asked Nov 20 '10 at 05:33

Salman

2,119
3
15
14

199

votes

23 answers

Regex select all text between tags

What is the best way to select all the text between 2 tags - ex: the text between all the '

' tags on the page.

html regex html-parsing

asked Aug 23 '11 at 20:42

basheps

10,034
11
36
45

165

votes

10 answers

How to extract img src, title and alt from html using php?

I would like to create a page where all images which reside on my website are listed with title and alternative representation. I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and…

php html regex html-parsing html-content-extraction

asked Sep 26 '08 at 08:33

Sam

28,421
49
167
247

139

votes

0 answers

Robust and Mature HTML Parser for PHP

Are there any robust and mature HTML parsers available for PHP? A quick skimming of PEAR didn't turn anything up (lots of classes for generating HTML, not so much for consuming), and Google taught me a lot of people have started and then abandoned a…

php html html-parsing

asked Nov 15 '08 at 19:09

Alana Storm

164,128
91
395
599

118

votes

8 answers

How to extract string following a pattern with grep, regex or perl

I have a file that looks something like this:

regex perl sed html-parsing text-extraction

asked Feb 22 '11 at 16:34

wrangler

1,995
2
19
22

109

votes

6 answers

How do I parse a HTML page with Node.js

I need to parse (server side) big amounts of HTML pages. We all agree that regexp is not the way to go here. It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the…

node.js html-parsing server-side

asked Sep 10 '11 at 16:18

Itay Moav -Malimovka

52,579
61
190
278

votes

5 answers

How do HTML parses work if they're not using regexp?

I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted). This is rather confusing…

html regex parsing html-parsing

asked Mar 08 '10 at 10:30

Andy E

338,112
86
474
445

votes

1 answer

How to get HTML from a beautiful soup object

I have the following bs4 object listing: >>> listing

.... >>> type(listing) I want to extract the raw html as a string. I've tried: >>> a = listing.contents >>> type(a) So…

python html beautifulsoup html-parsing

asked Sep 08 '14 at 17:13

user1592380

34,265
92
284
515

votes

8 answers

How to normalize HTML in JavaScript or jQuery?

Tags can have multiple attributes. The order in which attributes appear in the code does not matter. For example: How can I "normalize" the HTML in Javascript, so the order of the attributes is always…

javascript jquery html html-parsing

asked Oct 20 '10 at 04:19

Julien

5,729
4
37
60

2 3

…

99 100 Next