3

I am reading a page and trying to extract some data from it. I am interested in using bash and after going through few links, i came to know that 'Shell Parameter Expansion' might help however, i am finding difficulty using it in my script. I know that using sed might be easier but just for my knowledge i want to know how can i achieve this in bash.

shopt -s extglob

str='My work</u><br /><span style="color: rgb(34,34,34);"></span><span>abc-X7-27ABC | </span><span style="color: rgb(34,34,34);">build'
echo "${str//<.*>/|}"

I want my output to be like this: My work|abc-X7-27ABC |build

I thought of checking whether it accepts only word instead of pattern and it seems to be working with words.

For instance,
echo "${str//span style/|}" works but
echo "${str//span.*style/|}" doesn't

On the other hand, i saw in one of the link that it does accept pattern. I am confused why it's not working with the patern i am using above.

How to make sed do non-greedy match? (User konsolebox's solution)

Community
  • 1
  • 1
Technext
  • 7,887
  • 9
  • 48
  • 76
  • Are you guaranteed to have a `|` character in the `abc ...` part? – devnull Nov 27 '13 at 11:49
  • | character will not be there initially. You can ignore it. All i want is to remove everything _before_ and _after_ the following words: _My work_, _abc-X7-27ABC_ and _build_. While i am removing the HTML syntax, i would like to have | as the replacement character. The output should, therefore, look like `My work|abc-X7-27ABC |build` – Technext Nov 27 '13 at 12:02

2 Answers2

3

One mistake you're making is by mixing shell globbing and regex. In shell glob dot is taken literally as dot character not as 0 or more of any character.

If you try this code instead:

echo "${str//<*>/|}"

then it will print:

My work|build
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Thanks @anubhava for pointing it out. Now the issue is that i am not getting the result i actually want. It's doing a greedy match. Is there any way i can use something like '${parameter%word}' to get the output i want in bash: `My work|abc-X7-27ABC |build`? – Technext Nov 27 '13 at 11:07
  • @Technext: I am on mobile at present. When I get back to my computer I will play with your input text to see if I can get the desired output. – anubhava Nov 27 '13 at 11:09
  • 3
    The problem here is that you don't want a greedy match here (which eats up everything from the `` tag to the `` prior to "build"), but you also don't want a non-greedy match (which would result in too many substitutions: "My work|||||abc-X7-27ABC | ||build". The same advice that goes for regular expressions and HTML parsing ("Don't. Use a proper HTML parser.") applies even more so for shell patterns, which are less powerful and primarily intended for matching simple sets of file names. – chepner Nov 27 '13 at 13:37
  • @Technext: See detailed comment from `Chepner` above. It is not a good idea to parse HTML like this using shell utilities. Using a DOM parser like in PHP should be ideal way. – anubhava Nov 27 '13 at 13:54
  • Thanks @chepner & anubhava for your comments. I also got stuck with the following output: "My work|||||abc-X7-27ABC | ||build". Can the output be achieved with sed or awk or Perl? – Technext Nov 28 '13 at 06:15
  • @Technext: If you got above string you can pipe it to another sed like: `echo "My work|||||abc-X7-27ABC | ||build" | sed 's/ *| *| *//g'` to get `My work|abc-X7-27ABC|build` – anubhava Nov 28 '13 at 06:38
  • You're welcome. If you think my answer helped you anyway please consider marking it as accepted. – anubhava Nov 28 '13 at 11:20
1

This is not an answer, so much as a demonstration of why pattern-matching is not recommended for this kind of HTML editing. I attempted the following.

shopt -s extglob
set +H    # Turn off history expansion, if necessary, to allow the !(...) pattern
echo ${str//+(<+(!(>))>)/|}

First: it didn't work, even for a simpler string like str='My work</u><br />bob<foo>build'. Second, for the string in the original question, it appeared to lock up the shell; I suspect such a complex pattern triggers exponential backtracking.

Here's how it's intended to work:

  1. !(>) is any thing other than a single >
  2. +(!(>)) is one or more non-> characters.
  3. <+(!(>))> is one or more non-> characters enclosed in < and >
  4. +(<+(!(>))>) is one or more groups of <...>-enclosed non->s.

My theory is that since !(>) can match a multi-character string as well as a single character, there is a ton of backtracking required.

chepner
  • 497,756
  • 71
  • 530
  • 681