0

I have a long .XML file (60K or so lines) that I'm working with. I need bash to start the script and have a user input a name to be removed from the .XML file. I was thinking sed but if there is a better option, I'm open to that too. Here is what I've got so far:

echo -n "Type media to remove and press [ENTER]"
read TARGET

while true; do
    read -p "Are you sure you wish to remove $TARGET from the system?" yn
    case $yn in
        [Yy]* ) SED COMMAND HERE; break;;
        [Nn]* ) echo "Cancelling..."; exit;;
    * ) echo "---please answer [Y] or [N]";;
    esac
done

And here is a section of the .XML file. Note that this section I'm posting repeats through the .XML hundreds of times. The only difference in the blocks are what I have labelled "corrupt" for this example.

<media>
  <name>"corrupt"</name>
  <parent>system</parent>
  <location>/path/to/the/"corrupt".zip</location>
  <video>/another/path/"corrupt".flv</video>
  <images>
    <image>
      <type>saved</type>
      <image-file>/yet/another/path/"corrupt".png</image-file>
    </image>
  </images>
</media>

In this example, I would wish to remove "corrupt" from the .XML file. I think it is important to say that there is only 1 instance of "corrupt" in the .XML file. Also, for other "corrupt_files", there are no spaces in the file names, only underscores or dashs.

So sed would need to remove the entire xml block containing "corrupt" information, leaving no empty lines where it removed text, then the script would overwrite the current "media.xml" file.

I hope this question isn't confusing.

  • 4
    `sed` is really the [wrong tool for any kind of structured input such as XML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)... – jub0bs Mar 14 '15 at 17:28
  • 1
    You write 'So sed would need to remove the entire xml block containing "corrupt" information'; what do you mean by "the entire xml block"? Do you mean just the `...` and `` elements, or do you mean the entire `...` element? As @Jubobs points out, either way, `sed` is probably the wrong tool for the job, but which one you are referring to will affect the answer. – Brian Campbell Mar 14 '15 at 19:40
  • @Brian ...I'm sorry. By block, I mean the entire example of xml I posted. That is what I am referring to as a block. So everything in and including ... – GHO5TLIKE5WAYZE Mar 14 '15 at 20:54
  • @jubobs I don't have to use `sed` , that's just what I thought would work – GHO5TLIKE5WAYZE Mar 14 '15 at 20:55

1 Answers1

1

You should use correct xml tool, but this gnu awk removes the block where name contains corrupt

cat file
<media>
  <name>"test1"</name>
  <parent>system</parent>
  <location>/path/to/the/"test1".zip</location>
  <video>/another/path/"test1".flv</video>
  <images>
    <image>
      <type>saved</type>
      <image-file>/yet/another/path/"test1".png</image-file>
    </image>
  </images>
</media>
<media>
  <name>"corrupt"</name>
  <parent>system</parent>
  <location>/path/to/the/"corrupt".zip</location>
  <video>/another/path/"corrupt".flv</video>
  <images>
    <image>
      <type>saved</type>
      <image-file>/yet/another/path/"corrupt".png</image-file>
    </image>
  </images>
</media>
<media>
  <name>"test2"</name>
  <parent>system</parent>
  <location>/path/to/the/"test2".zip</location>
  <video>/another/path/"test2".flv</video>
  <images>
    <image>
      <type>saved</type>
      <image-file>/yet/another/path/"test2".png</image-file>
    </image>
  </images>
</media>

awk -v RS="<media>" '!/<name>"corrupt/ && NR>1 {print RS$0}'
<media>
  <name>"test1"</name>
  <parent>system</parent>
  <location>/path/to/the/"test1".zip</location>
  <video>/another/path/"test1".flv</video>
  <images>
    <image>
      <type>saved</type>
      <image-file>/yet/another/path/"test1".png</image-file>
    </image>
  </images>
</media>

<media>
  <name>"test2"</name>
  <parent>system</parent>
  <location>/path/to/the/"test2".zip</location>
  <video>/another/path/"test2".flv</video>
  <images>
    <image>
      <type>saved</type>
      <image-file>/yet/another/path/"test2".png</image-file>
    </image>
  </images>
</media>
Jotne
  • 40,548
  • 12
  • 51
  • 55
  • seems whenever I `cd` to file and run `awk -v RS="" '!/"corrupt/ && NR>1 {print RS$0}'` nothing happens. The actions starts, but I have to `ctrl+c` to get it to exit and the .xml file to be changed is the same as before. Maybe I'm overlooking something? – GHO5TLIKE5WAYZE Mar 16 '15 at 20:25
  • haha...geesh. But still, `cat test.xml > awk -v RS="" '!/"corrupt/ && NR>1 {print RS$0}' test.xml > test-new.xml` just seems to add a blank space between every block. – GHO5TLIKE5WAYZE Mar 16 '15 at 21:10
  • AH! Just had to remove the symbol " that is in `'!/"corrupt/` That seems to remove the corrupt block just fine and adds a blank space between the blocks, but I can just another command to remove the blank lines. – GHO5TLIKE5WAYZE Mar 16 '15 at 21:12
  • @GHO5TLIKE5WAYZE One negative thing with setting `RS` like this is the blank line. You can remove it later with `sed` or `awk` . But a good rule is not to use `cat` with `sed/awk` or other programs that can read data itself. `awk -v RS="" '!/"corrupt/ && NR>1 {print RS$0}' test.xml > test-new.xml` is a good way to do it. – Jotne Mar 16 '15 at 21:19
  • But now a new issue....there are tags that fall above and below the blocks like so ` {xml blocks} ` Seems that the command strips all of these tags out along with the "corrupt" block. Leaving me only the `` blocks. I suppose I could just `sed` them back into the file before overwriting the original. – GHO5TLIKE5WAYZE Mar 16 '15 at 21:20
  • It is because the tag that `awk` searches for appears above the blocks in the `` tag. I suppose it still sees the word "media" in there and removes it. No worries...will just `sed` them back into place. – GHO5TLIKE5WAYZE Mar 16 '15 at 21:27
  • @GHO5TLIKE5WAYZE If you post more of you code, so I can see what goes wrong, I may be able to help you. You may possible narrow down the `RS` to not hit extra data.' – Jotne Mar 17 '15 at 05:38
  • too much to post in the comments. Email perhaps? Then we could post the solution here and mark it answered. – GHO5TLIKE5WAYZE Mar 18 '15 at 06:05