2

I wanna replace value between a tag by equal number of X. For example

1.

<Name> Jason </Name>
to
<Name> XXXXX </Name>

2. (see no space)

 <Name>Jim</Name>
 to
 <Name>XXX</Name>

3.

<Name Jason /> 
to 
<Name XXXXX />`

4.

<Name Jas />
to
<Name XXX />

starting tag, value and closing tag can all come in different line

5.

<Name>Jim
</Name>
to
<Name>XXX
</Name>

6.

<Name>
     Jim
       </Name>
to
<Name>
     XXX
       </Name>

7.

  <Name
     Jim
       />
to
  <Name
     XXX
       />

8.

<Name> Jason </Name> <Name> Ignacio </Name>
to
<Name> XXXXX </Name> <Name> XXXXXX </Name>

9.

<Name> Jason Ignacio </Name>
to
<Name> XXXXX XXXXXXX </Name>
or
<Name> XXXXXXXXXXXXX </Name>

both are fine

I tried this, but it didn't worked

file=mylog.log
search_str="<Name>"
end_str="</Name>"
sed -i -E ':a; s/('"$search_str"'X*)[^X'"$end_str"']/\1X/; ta' "$file"

Please let me know how to do this in bash script....

Update:

I tried this also, but didn't worked for 6 and 7 cases. case 1 to 5 worked.

sed -i -E '/<Name>/{:a; /<\/Name>/bb; n; ba; :b; s/(<Name>X*)[^X\<]/\1X/; tb; }' "$file"
sed -i -E '/<Name[[:space:]]/{:a; /\/>/bb; n; ba; :b; s/(<Name[[:space:]]X*)[^X\/]/\1X/; tb; }' "$file"
Puneet Jain
  • 97
  • 1
  • 10
  • 1
    Suggest using an `xml` aware tool (or) tools which are able to parse `` elements. `sed` or `awk` is not the best way for the same – Inian Aug 12 '16 at 06:37
  • 1
    It can be done, but it is going to be painful for each case. I'm not willing to undergo the pain involved in producing the answer. Cases 3 & 4 look to be identical. Cases 1 & 2 can be handled with a regex along the lines of `s/\([[:space:]]*X*\)\([^X[:space:]]\)\([[:space:]]*<\/Name>\)/\1X\2/` and some `sed` hackery to iterate until there are no substitutions (a label and a test/branch operation). Note that the mechanism shown won't deal with ` Jason Bourne ` — you have some changes to make (easy ones, as it happens). Have fun. – Jonathan Leffler Aug 12 '16 at 06:38
  • Incidentally, if you can have ` Jason Bourne `, should the output be ` XXXXX XXXXXX ` or ` XXXXXXXXXXXX `? – Jonathan Leffler Aug 12 '16 at 06:43
  • @JonathanLeffler since given `v="Jason"` and `echo "${v//?/X}"` returns XXXXX, wouldn't using `sed` to perform this parameter expansion like in [here](http://stackoverflow.com/a/34080390/1983854) help? I find it difficult to use parameter expansion against a captured group, though. – fedorqui Aug 12 '16 at 06:46
  • 1
    @fedorqui: I'm guessing that the names are not just Jason, Jim, Jas, in general. And a single chunk of XML could have a myriad different names — I'm envisaging XML output of a customer table with names masked, for example. So I think anything that uses shell to find a value and do the substitution is going to be painful — but the whole exercise is going to be painful regardless. You really need an XML parser (Perl, Python, ...) and to process it that way. – Jonathan Leffler Aug 12 '16 at 06:49
  • 2
    @Inian The problem is, `` is not valid XML. – Michael Vehrs Aug 12 '16 at 06:54
  • My regex was a bit off. `sed -e ': l1' -e 's/\([[:space:]]*X*\)\([^X[:space:]]\)\(.*[[:space:]]*<\/Name>\)/\1X\3/' -e 't l1'` deals with single word names where the start and end tag are on the same line. It does handle the case where there are two complete name entries on a single line. I take it back; adapting that to handle multi-word names is not straight-forward, regardless of whether embedded blanks should be retained or mapped to `X` too. You probably need code around that to detect that you've got the single line case so you can try the multi-line cases separately. – Jonathan Leffler Aug 12 '16 at 07:00
  • If you need to handle a line with case 1 and, say, case 6 starting on the same line, things get still more fraught. Any regex-based solution will be fragile — only a proper parsing solution is going to be reliable. (See [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](http://stackoverflow.com/questions/701166/)) – Jonathan Leffler Aug 12 '16 at 07:04
  • @JonathanLeffler Multi word with in a tag is fine. "Mike Author" can be replaced either by "XXXX XXXXXX" or by "XXXXXXXXXXX". does not really matter. – Puneet Jain Aug 12 '16 at 07:28
  • @John1024 Adding john to get his inputs !! – Puneet Jain Aug 15 '16 at 22:37

2 Answers2

3

Provisional solution

This extends the 'initial offering' below and handles cases 1, 2, 5, 6, 8, 9. It does not handle the case where there is one or more complete <Name>…</Name> entries and also a starting <Name> without the matching </Name> on the same line. Frankly, I'm not even sure how to start tackling that scenario.

The unhandled cases 3, 4, 7 are not valid XML — I'm not convinced they're valid HTML (or XHTML) either. I believe they can be handled by a similar (but simpler) mechanism to the one shown here for the full <Name>…</Name> version. I'm leaving that as an exercise for the reader (beware the < in the character class — it would need to become a /).

script.sed

/<Name>/! b
/<Name>.*<\/Name>/{
: l1
s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l1
b
}
/<Name>/,/<\/Name>/{
  # Handle up to 4 lines to the end-name tag
  /<\/Name>/! N
  /<\/Name>/! N
  /<\/Name>/! N
  /<\/Name>/! N
# s/^/ZZ/; s/$/AA/p
# s/^ZZ//; s/AA$//
  : l2
  s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
  t l2
}

The first line 'skips' processing of lines not containing <Name> (they get printed and the next line is read). The next 6 lines are the script from the 'initial offering' except that there's a b to jump to the end of processing.

The new section is the /<Name>/,/<\/Name>/ code. This looks for <Name> on its own, and concatenates up to 4 lines until a </Name> is included in the pattern space. The two comment lines were used for debugging — they allowed me to see what was being treated as a unit. Except for the use of the label l2 in place of l1, the remainder is exactly the same as in the initial offering — sed regexes already accommodate newlines.

This is heavy-duty sed scripting and not what I'd want to use or maintain. I would go with a Perl solution using an XML parser (because I know Perl better than Python), but Python would do the job fine too with an appropriate XML parser.

data

A slightly extended data file.

<Name> Jason </Name>
<Name>Jim</Name>
<Name> Jason Bourne </Name>
<Name> Elijah </Name> <Name> Dennis </Name>
<Name> Elijah Wood </Name> <Name> Dennis The Menace </Name>
<Name>Elijah Wood</Name> <Name>Dennis The Menace</Name>
<Name> Jason
        </Name>
<Name>
    Jim</Name>
<Name>
    Jim
        </Name>
<Name> Jason
Bourne </Name>
<Name> 
    Jason
        Bourne
            </Name>
<Name> Elijah </Name>
<Name>
Dennis
</Name>
<Name> Elijah
Wood </Name>
            <Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name>
    <Name>Dennis The
Menace</Name>



<Name> Jason </Name>
to
<Name> XXXXX </Name>

2. (see no space)

 <Name>Jim</Name>
 to
 <Name>XXX</Name>

3.

<!--Name Jason /--> 
to 
<!--Name XXXXX /-->`

4.

<!--Name Jas /-->
to
<!--Name XXX /-->

starting tag, value and closing tag can all come in different line

5.

<Name>Jim
</Name>
to
<Name>XXX
</Name>

6.

<Name>
     Jim
       </Name>
to
<Name>
     XXX
       </Name>

7.

  <!--Name
     Jim
       /-->
to
  <!--Name
     XXX
       /-->

8.

<Name> Jason </Name> <Name> Ignacio </Name>
to
<Name> XXXXX </Name> <Name> XXXXXX </Name>

9.

<Name> Jason Ignacio </Name>
to
<Name> XXXXX XXXXXXX </Name>
or
<Name> XXXXXXXXXXXXX </Name>

No claims are made that the data file contains a minimal set of cases; it is repetitious. It includes the material from the question, except that the 'unorthodox' XML elements like <Name Value /> are converted into XML comments <!--Name Value /-->. The mapping actually isn't crucial; the opening part doesn't match <Name> (and the tail doesn't match </Name>) so they'd not be processed anyway.

Output

$ sed -f script.sed data
<Name> XXXXX </Name>
<Name>XXX</Name>
<Name> XXXXX XXXXXX </Name>
<Name> XXXXXX </Name> <Name> XXXXXX </Name>
<Name> XXXXXX XXXX </Name> <Name> XXXXXX XXX XXXXXX </Name>
<Name>XXXXXX XXXX</Name> <Name>XXXXXX XXX XXXXXX</Name>
<Name> XXXXX
        </Name>
<Name>
    XXX</Name>
<Name>
    XXX
        </Name>
<Name> XXXXX
XXXXXX </Name>
<Name> 
    XXXXX
        XXXXXX
            </Name>
<Name> XXXXXX </Name>
<Name>
XXXXXX
</Name>
<Name> XXXXXX
XXXX </Name>
            <Name> XXXXXX
XXX XXXXXX </Name>
<Name>XXXXXX
XXXX</Name>
    <Name>XXXXXX XXX
XXXXXX</Name>



<Name> XXXXX </Name>
to
<Name> XXXXX </Name>

2. (see no space)

 <Name>XXX</Name>
 to
 <Name>XXX</Name>

3.

<!--Name Jason /--> 
to 
<!--Name XXXXX /-->`

4.

<!--Name Jas /-->
to
<!--Name XXX /-->

starting tag, value and closing tag can all come in different line

5.

<Name>XXX
</Name>
to
<Name>XXX
</Name>

6.

<Name>
     XXX
       </Name>
to
<Name>
     XXX
       </Name>

7.

  <!--Name
     Jim
       /-->
to
  <!--Name
     XXX
       /-->

8.

<Name> XXXXX </Name> <Name> XXXXXXX </Name>
to
<Name> XXXXX </Name> <Name> XXXXXX </Name>

9.

<Name> XXXXX XXXXXXX </Name>
to
<Name> XXXXX XXXXXXX </Name>
or
<Name> XXXXXXXXXXXXX </Name>
$

Initial offering

A partial answer — but it illustrates the problems you face. Dealing with cases 1 & 2 in the question, plus the multi-word variations, you can use the script:

script.sed

/<Name>.*<\/Name>/{
: l1
s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l1
}

That is pretty contorted, to be polite about it. It looks for <Name> followed by zero or more spaces. That can be followed by \(X[X[[:space:]]*\)\{0,1\}, which means zero or one occurrences of an X followed by a sequence of X's or spaces. All of that is captured as \1 in the replacement. Then there's a single character that isn't an X, < or space, followed by zero or more any characters, zero or more spaces, and </Name>. The single character in the middle is replaced by an X. The whole replacement is repeated until there are no more matches via the label : l1 and the conditional branch t l1. All that operates only on a line with both <Name> and </Name>.

data

<Name> Jason </Name>
<Name>Jim</Name>
<Name> Jason Bourne </Name>
<Name> Elijah </Name> <Name> Dennis </Name>
<Name> Elijah Wood </Name> <Name> Dennis The Menace </Name>
<Name>Elijah Wood</Name> <Name>Dennis The Menace</Name>
<Name> Jason
</Name>
<Name>
Jim</Name>
<Name> Jason
Bourne </Name>
<Name> Elijah </Name> <Name> Dennis
</Name>
<Name> Elijah
Wood </Name> <Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name> <Name>Dennis The
Menace</Name>

Output

$ sed -f script.sed data
<Name> XXXXX </Name>
<Name>XXX</Name>
<Name> XXXXX XXXXXX </Name>
<Name> XXXXXX </Name> <Name> XXXXXX </Name>
<Name> XXXXXX XXXX </Name> <Name> XXXXXX XXX XXXXXX </Name>
<Name>XXXXXX XXXX</Name> <Name>XXXXXX XXX XXXXXX</Name>
<Name> Jason
</Name>
<Name>
Jim</Name>
<Name> Jason
Bourne </Name>
<Name> XXXXXX </Name> <Name> Dennis
</Name>
<Name> Elijah
Wood </Name> <Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name> <Name>Dennis The
Menace</Name>
$

Note the replacement part way through the end. That line is going to cause headaches for anything more.

I've not worked out how the script would handle the various split-line cases, beyond it would almost certainly need to join lines until the </Name> is caught. It would then do processing closely related to that already shown, but it would need to allow for newlines in the matched material.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
1

Try this python script:

$ cat script.py
#!/usr/bin/python
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('allcases'), features="xml")
for tag in soup.findAll('Name'):
    for name in 'Jason Ignacio', 'Jason', 'Jim':
        tag.string =  re.sub(r'\b%s\b' % name, len(name)*'X', tag.string)
print(str(soup))

This code is compatible with either python2 or python3.

To make it work, you may need to install the BeautifulSoup module. On a debian-like system:

apt-get install python-bs4

Or, for python3:

apt-get install python3-bs4

Example

Let's consider this input file:

$ cat cases
<page>
<p>Jason</p>
<Name> Jason </Name>
<p>Jason</p>
 <Name>Jim</Name>
<p>Jim</p>
<Name>Jim
</Name>
<Name>
     Jim
       </Name>
<Name> Jason </Name> <Name> Ignacio </Name>
<Name> Jason Ignacio </Name>
</page>

Let's run our script and observe the output:

$ python script.py
<?xml version="1.0" encoding="utf-8"?>
<page>
<p>Jason</p>
<Name> XXXXX </Name>
<p>Jason</p>
<Name>XXX</Name>
<p>Jim</p>
<Name>XXX
</Name>
<Name>
     XXX
       </Name>
<Name> XXXXX </Name> <Name> Ignacio </Name>
<Name> XXXXXXXXXXXXX </Name>
</page>

Note that the names in <p> tags are left alone. The code only changes the names in <Name> tags.

Also, as per the design, Jim, Jason, and Jason Ignacio are changed to X's but other names are left alone. Even Ignacio, if it appears without an adjacent Jason, is left alone.

John1024
  • 109,961
  • 14
  • 137
  • 171
  • Thanks @John1024. This is exactly the result i want.. however on our server, there is no python.. only shell/bash.. Do you mind giving me an equivalent sed command... or may be tell me whats wrong in my sed command ? – Puneet Jain Aug 20 '16 at 09:09
  • Hey john @John1024 .. Do you know how to fix case 6/7 using sed command? – Puneet Jain Aug 25 '16 at 14:13
  • @PuneetJain Would you clarify one point: are looking to change only specific names? Or, do you want to change all alphabetic characters that occur inside the Name tags? – John1024 Aug 25 '16 at 19:37
  • John @John1024 Anything inside the Name tag. Name is just 1 word shown here.. In reality i would be using a variable, whose value would come from name array like : (John, Jim, Carry, Marry, SSN, Dude) etc.. Currently I am using 2 sed command : sed -i -E ':a; s/('"$search_str1"'X*)[^X\<]/\1X/; ta' "$newfile" sed -i -E ':a; s/('"$search_str2"'X*)[^X\/]/\1X/; ta' "$newfile" But they are not working for multiline cases, like #6 and #7 case. Also, i would really like to combine these 2 sed commands into one, if possible. – Puneet Jain Aug 26 '16 at 03:32
  • HELLOOOOO!! u there? @John1024 – Puneet Jain Aug 29 '16 at 00:12