2

I have an HTML file that contains the following ALT attribute:

alt="Hello I am <nobr>Please&nbsp;replace&nbsp;me</nobr> and I'm cool"

I need to use SED in a bash script to replace the above line with:

alt="Hello I am Please replace me and I'm cool"

How do I only target only the tag inside a alt attribute?

3 Answers3

1

If you are ok with awk then try following:(taking all strings whic you want to substitute in variables of awk)

awk -v val="<nobr>" -v val1="&nbsp;" -v val2="</nobr>" '
/^alt/{
  gsub(val," ")
  gsub(val1," ")
  gsub(val2," ")
}
1'  Input_file

OR

awk -v val="<nobr>" -v val1="&nbsp;" -v val2="</nobr>" '
/^alt/{
  gsub(val"|"val1"|"val2," ")
}
1'  Input_file

Append > temp_file && mv temp_file Input_file to above codes, in case you want to make the changes into Input_file itself.

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

A sed answer would be:

 sed -E '/alt=/{:a s/(<nobr>)(.*)&nbsp;(.*)(<\/nobr>)/\1\2 \3\4/;ta; s/<nobr>(.*)<\/nobr>/\1/}'

Explanation:

  • /alt=/ only change lines containing alt=
  • s/(<nobr>)(.*)&nbsp;(.*)(<\/nobr>)/\1\2 \3\4/ replace one &nbsp; with space
  • ta repeat if succesful; that is, jump to a
  • s/<nobr>(.*)<\/nobr>/\1/ finally remove the <nobr> and </nobr>

Added: Because sed is greedy, this script will fail if there are two </nobr>'s in the line. While there are work-arounds--see ishahak\s answer to Non greedy (reluctant) regex matching in sed? --it becomes a pain.

In any case, this answer is already overkill as it is, since the OP found that a much simpler solution sufficed for their needs; see comment below.

Joseph Quinsey
  • 9,553
  • 10
  • 54
  • 77
  • 1
    Since I wanted to remove all the nobr tags and the nbsps, I went this route: `sed -i "/alt=/{ s|||g; s|||g; s| | |g; }" "$projectPath/$htmlfile"` – Alexandru Popovici Aug 03 '18 at 18:14
0

Here's a hamfisted way of doing it:

% sed $'s#alt="Hello I am <nobr>Please&nbsp;replace&nbsp;me</nobr> and I\'m cool"#alt="Hello I am Please replace me and I\'m cool"#' < file.html

My suggestion would be to not parse html using shell tools - it will only lead to tears and frustration. Use python's BeautifulSoup module instead.

keithpjolley
  • 2,089
  • 1
  • 17
  • 20
  • btw, see this answer for what that extraneous '$' is doing in there. https://stackoverflow.com/questions/8254120/how-to-escape-a-single-quote-in-single-quote-string-in-bash – keithpjolley Aug 03 '18 at 17:45