sed don't match characters inside parenthesis

Question

I'm trying to come up with a SED greedy expression which ignores the stuff inside html quotes and ONLY matches the text of that element.

<p alt="100">100</p> #need to match only second 100
<img src="100.jpg">100</img> #need to match only second 100
<span alt="tel:100">100</span> #need to match only second 100

These are my attempts:

grep -E '(!?\")100(!?\")' html # this matches string as well as quotes 
grep -E '[^\"]100[^\"]' html # this doesn't work either

Edit

Ok. I was trying to simplify the question but maybe that's wrong.

with command sed -r '/?????/__replaced__/g' file i would need to see :

<p alt="100">__replaced__</p>
<img src="100.jpg">__replaced__</img> 
<span alt="tel:100">__replaced__</span>

Are you sure your second example doesn't work?: https://regex101.com/r/qG6gX8/1 — Sobrique, Jul 03 '15 at 12:13
Hi @fedorqui , i've slightly updated my question so it's more clear but my requirement is that the same number is replaced in every single line. If i've made it 100,200,300 then that changes the question. — zzart, Jul 03 '15 at 13:00

Wintermute · Accepted Answer · 2015-07-03T13:07:43.727

I don't think handling HTML with sed (or grep) is a good idea. Consider using python, which has an HTML push parser in its standard library. This makes separating tags from data easy. Since you only want to handle the data between tags, it could look something like this:

#!/usr/bin/python

from HTMLParser import HTMLParser
from sys import argv

class MyParser(HTMLParser):
    def handle_data(self, data):
        # data is the string between tags. You can do anything you like with it.
        # For a simple example:
        if data == "100":
            print data

# First command line argument is the HTML file to handle.
with open(argv[1], "r") as f:
    MyParser().feed(f.read())

Update for updated question: To edit HTML with this, you'll have to implement the handle_starttag and handle_endtag methods as well as handle_data in a manner that reprints the parsed tags. For example:

#!/usr/bin/python

from HTMLParser import HTMLParser
from sys import stdout, argv
import re

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        stdout.write("<" + tag)
        for k, v in attrs:
            stdout.write(' {}="{}"'.format(k, v))
        stdout.write(">")

    def handle_endtag(self, tag):
        stdout.write("</{}>".format(tag))

    def handle_data(self, data):
        data = re.sub("100", "__replaced__", data)
        stdout.write(data)

with open(argv[1], "r") as f:
    MyParser().feed(f.read())

You are right. After evaluating problem with a bigger sample set it looks like I can't get away with simple regex. — zzart, Jul 06 '15 at 07:22
Yes. Regex can't parse html. It can fake it in small.subsets. — Sobrique, Jul 06 '15 at 08:24

score 2 · Answer 2 · edited May 23 '17 at 12:06

2

First warning is that HTML is not a good idea to parse with regular expressions - generally speaking - use an HTML parser is the answer. Most scripting languages (perl, python etc.) have HTML parsers.

See here for an example as to why: RegEx match open tags except XHTML self-contained tags

If you really must though:

/(?!\>)([^<>]+)(?=\<)/

DEMO

edited May 23 '17 at 12:06

Community

1
1

answered Jul 03 '15 at 12:06

Sobrique

52,974
7
60
101

Thanks but this matches >100< and i need to be accurate and only match the number without any other characters . – zzart Jul 03 '15 at 12:17

score 1 · Answer 3 · answered Jul 03 '15 at 12:19

1

You may try the below PCRE regex.

grep -oP '"[^"]*100[^"]*"(*SKIP)(*F)|\b100\b' file

or

grep -oP '"[^"]*"(*SKIP)(*F)|\b100\b' file

This would match the number 100 which was not present inside double quotes.

DEMO

answered Jul 03 '15 at 12:19

Avinash Raj

172,303
28
230
274

don't think that's sed compatibile . sed: -e expression #1, char 35: Invalid preceding regular expression – zzart Jul 03 '15 at 12:25

score 0 · Answer 4 · answered Jul 03 '15 at 14:59

You're questions gotten kinda muddy through it's evolution but is this what you're asking for?

$ sed -r 's/>[^<]+</>__replaced__</' file
<p alt="100">__replaced__</p> #need to match only second 100
<img src="100.jpg">__replaced__</img> #need to match only second 100
<span alt="tel:100">__replaced__</span> #need to match only second 100

If not please clean up your question to just show the latest sample input and expected output and explanation.

sed don't match characters inside parenthesis

Edit

4 Answers4