0

I'm trying to come up with a SED greedy expression which ignores the stuff inside html quotes and ONLY matches the text of that element.

<p alt="100">100</p> #need to match only second 100
<img src="100.jpg">100</img> #need to match only second 100
<span alt="tel:100">100</span> #need to match only second 100

These are my attempts:

grep -E '(!?\")100(!?\")' html # this matches string as well as quotes 
grep -E '[^\"]100[^\"]' html # this doesn't work either

Edit

Ok. I was trying to simplify the question but maybe that's wrong.

with command sed -r '/?????/__replaced__/g' file i would need to see :

<p alt="100">__replaced__</p>
<img src="100.jpg">__replaced__</img> 
<span alt="tel:100">__replaced__</span> 
Cœur
  • 37,241
  • 25
  • 195
  • 267
zzart
  • 11,207
  • 5
  • 52
  • 47

4 Answers4

4

I don't think handling HTML with sed (or grep) is a good idea. Consider using python, which has an HTML push parser in its standard library. This makes separating tags from data easy. Since you only want to handle the data between tags, it could look something like this:

#!/usr/bin/python

from HTMLParser import HTMLParser
from sys import argv

class MyParser(HTMLParser):
    def handle_data(self, data):
        # data is the string between tags. You can do anything you like with it.
        # For a simple example:
        if data == "100":
            print data

# First command line argument is the HTML file to handle.
with open(argv[1], "r") as f:
    MyParser().feed(f.read())

Update for updated question: To edit HTML with this, you'll have to implement the handle_starttag and handle_endtag methods as well as handle_data in a manner that reprints the parsed tags. For example:

#!/usr/bin/python

from HTMLParser import HTMLParser
from sys import stdout, argv
import re

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        stdout.write("<" + tag)
        for k, v in attrs:
            stdout.write(' {}="{}"'.format(k, v))
        stdout.write(">")

    def handle_endtag(self, tag):
        stdout.write("</{}>".format(tag))

    def handle_data(self, data):
        data = re.sub("100", "__replaced__", data)
        stdout.write(data)

with open(argv[1], "r") as f:
    MyParser().feed(f.read())
Wintermute
  • 42,983
  • 5
  • 77
  • 80
2

First warning is that HTML is not a good idea to parse with regular expressions - generally speaking - use an HTML parser is the answer. Most scripting languages (perl, python etc.) have HTML parsers.

See here for an example as to why: RegEx match open tags except XHTML self-contained tags

If you really must though:

/(?!\>)([^<>]+)(?=\<)/

DEMO

Community
  • 1
  • 1
Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • Thanks but this matches >100< and i need to be accurate and only match the number without any other characters . – zzart Jul 03 '15 at 12:17
1

You may try the below PCRE regex.

grep -oP '"[^"]*100[^"]*"(*SKIP)(*F)|\b100\b' file

or

grep -oP '"[^"]*"(*SKIP)(*F)|\b100\b' file

This would match the number 100 which was not present inside double quotes.

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • don't think that's sed compatibile . sed: -e expression #1, char 35: Invalid preceding regular expression – zzart Jul 03 '15 at 12:25
0

You're questions gotten kinda muddy through it's evolution but is this what you're asking for?

$ sed -r 's/>[^<]+</>__replaced__</' file
<p alt="100">__replaced__</p> #need to match only second 100
<img src="100.jpg">__replaced__</img> #need to match only second 100
<span alt="tel:100">__replaced__</span> #need to match only second 100

If not please clean up your question to just show the latest sample input and expected output and explanation.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185