Questions tagged [html-parsing]

HTML parsing is the process of consuming a serialization of an HTML document and producing a representation that you can work with programmatically — e.g., in order to extract data from it. The HTML specification defines a standard algorithm for parsing HTML, which is implemented in all major browsers.

HTML parsing typically involves converting an HTML document to a tree-based Document Object Model (DOM)

https://html.spec.whatwg.org/multipage/parsing.html#parsing has the standard algorithm for parsing HTML, which is implemented in all major browsers.

See also .

5960 questions
2298
votes
31 answers

How do you parse and process HTML/XML in PHP?

How can one parse HTML/XML and extract information from it?
RobertPitt
  • 56,863
  • 21
  • 114
  • 161
402
votes
15 answers

Parse an HTML string with JS

I want to parse a string which contains HTML text. I want to do it in JavaScript. I tried the Pure JavaScript HTML Parser library but it seems that it parses the HTML of my current page, not from a string. Because when I try the code below, it…
stage
  • 4,225
  • 4
  • 15
  • 8
309
votes
4 answers

How to strip HTML tags from string in JavaScript?

How can I strip the HTML from a string in JavaScript?
f.ardelian
  • 6,716
  • 8
  • 36
  • 53
254
votes
7 answers

Parsing HTML using Python

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects. If I have a document of the form: Heading
ffledgling
  • 11,502
  • 8
  • 47
  • 69
236
votes
18 answers

Using regular expressions to parse HTML: why not?

It seems like every question on stackoverflow where the asker is using regex to grab some information from HTML will inevitably have an "answer" that says not to use regex to parse HTML. Why not? I'm aware that there are quote-unquote "real" HTML…
ntownsend
  • 7,462
  • 9
  • 38
  • 35
208
votes
3 answers

How can I efficiently parse HTML with Java?

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation. Now, I want to separate both the tasks. I want to use a light HTML parser because it takes much time in…
Amit
  • 33,847
  • 91
  • 226
  • 299
201
votes
8 answers

What to do when a regular expression pattern doesn't match anywhere in a string?

I am trying to match of type hidden fields using this pattern: // This is some sample form data:
Salman
  • 2,119
  • 3
  • 15
  • 14
199
votes
23 answers

Regex select all text between tags

What is the best way to select all the text between 2 tags - ex: the text between all the '
' tags on the page.
basheps
  • 10,034
  • 11
  • 36
  • 45
165
votes
10 answers

How to extract img src, title and alt from html using php?

I would like to create a page where all images which reside on my website are listed with title and alternative representation. I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and…
Sam
  • 28,421
  • 49
  • 167
  • 247
139
votes
0 answers

Robust and Mature HTML Parser for PHP

Are there any robust and mature HTML parsers available for PHP? A quick skimming of PEAR didn't turn anything up (lots of classes for generating HTML, not so much for consuming), and Google taught me a lot of people have started and then abandoned a…
Alana Storm
  • 164,128
  • 91
  • 395
  • 599
118
votes
8 answers

How to extract string following a pattern with grep, regex or perl

I have a file that looks something like this:
wrangler
  • 1,995
  • 2
  • 19
  • 22
109
votes
6 answers

How do I parse a HTML page with Node.js

I need to parse (server side) big amounts of HTML pages. We all agree that regexp is not the way to go here. It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the…
Itay Moav -Malimovka
  • 52,579
  • 61
  • 190
  • 278
97
votes
5 answers

How do HTML parses work if they're not using regexp?

I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted). This is rather confusing…
Andy E
  • 338,112
  • 86
  • 474
  • 445
95
votes
1 answer

How to get HTML from a beautiful soup object

I have the following bs4 object listing: >>> listing

.... >>> type(listing) I want to extract the raw html as a string. I've tried: >>> a = listing.contents >>> type(a) So…

user1592380
  • 34,265
  • 92
  • 284
  • 515
85
votes
8 answers

How to normalize HTML in JavaScript or jQuery?

Tags can have multiple attributes. The order in which attributes appear in the code does not matter. For example: How can I "normalize" the HTML in Javascript, so the order of the attributes is always…
Julien
  • 5,729
  • 4
  • 37
  • 60
1
2 3
99 100