Bash: Content between two complex Patterns - html

Question

I have tried multiple times to get digits between two html patterns. Neither sed nor awk worked for me, since the examples in the internet were too easy to fit my task.

Here is the code I want to filter:

....class="a-size-base review-text">I WANT THIS TEXT</span></div> ....

So I would need a command that output: I WANT THIS TEXT between ...review-text"> and </span>

Do you have a clue? Thanks for the effort and greetings from Germany.

Here is the plain code

Please share what you have tried, and explain why they didn't work. — BenM, Nov 03 '17 at 13:36
[You shouldn't use regex for (X)HTML parsing](https://stackoverflow.com/a/1732454/404556), but a real xml parser like `xmllint`. If you provide more details of your html structure, we might help you to write the xpath query. — randomir, Nov 03 '17 at 13:52
Hi randomir: Here is the html file code: ibb.co/iNSDXb -- Thanks for the effort! — MultiF 95, Nov 03 '17 at 14:21
That's an image you posted there! Please see how to write [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). — randomir, Nov 03 '17 at 14:28
Here is my own solution: cat source.html | tr -d '>"' | grep -o 'review-text[^<>]*spandivdiv' | awk -F 'review-text' '{ print $2 }' | awk -F 'spandivdiv' '{ print $1 }' — MultiF 95, Nov 07 '17 at 12:17

hi olaf · Answer 1 · 2017-11-03T14:39:14.087

0

I can't see the problem here supposing the text you want to extract doesn't contains < nor >. For instance with POSIX REGEXP:

$ HTML_FILE=/tmp/myfile.html
$ sed -n "s/.*review-text.>\([^<]*\)<.*/\1/gp" $HTML_FILE

prints the text between HTML TAGS

edited Nov 03 '17 at 14:39

answered Nov 03 '17 at 13:46

hi olaf

227
1
5

----- It is a html source from a product review link. All I want is the Review Text itself. And as I said it is between those two patterns. Filtering the whole html file with your command did not work for me... – MultiF 95 Nov 03 '17 at 14:03
I updated my answer to extract all the text from a file instead of a single variable. – hi olaf Nov 03 '17 at 14:20
I needed to use sed -E instead of -r because im on OSX. Anyways it did not work as intended. sed gives me a huge list instead of only print the text between the two patterns. Thanks for the effort tho – MultiF 95 Nov 03 '17 at 14:25
@MultiF 95 you can use this simple regular expression command. It worked on the file supplied in your link: sed -n "s/.*review-text.>$[^<]*$<.*/\1/gp" $HTML_FILE – hi olaf Nov 03 '17 at 14:36

score 0 · Answer 2 · answered Nov 03 '17 at 13:54

0

Try:

tr '\n' ' ' file.html | grep -o 'review-text">[^<>]*</span> *</div>' | cut -d'>' -f2 | cut -d'<' -f 1

It should work if there are no any tags inside "I WANT THIS TEXT"

answered Nov 03 '17 at 13:54

mefju

539
7
26

Terminal says that I don't use tr correctly. Im on a Mac. Thats the fault? – MultiF 95 Nov 03 '17 at 14:16
On my linux machine tr command works, as expected. I can't test it on Mac now. You can try any other tool (sed,awk). First, change every end-of-line character to space, next pipe output to grep with -o option. It should works – mefju Nov 03 '17 at 14:30

Bash: Content between two complex Patterns - html

2 Answers2