0

Hello I need to remove a div when the sting contains: <!--googleoff: index-->

So I have code:

<span>TEXT</span><div><!--googleoff: index--> some other text</div><p>Some string</p>

And I need the output to look like this

<span>TEXT</span><p>Some string</p>

I trying find it how to do this in b4s but can't find a solution.

EDIT Full string:

<div style="font-size: 18px"><p><span style="font-size:18px;"><strong>Drivstofftankmonteringsdeler - Toyota Rav 4 2000-2006</strong></span></p></div><div style="font-size: 18px"> </div><div style="font-size: 18px"><!--googleoff: index-->En drivstofftank er en viktig del av bilen. Hvilken som helst motor er avhengig av drivstoffsystem med korrekt funksjon og bare den beste kvaliteten garanterer sikker kjøring. Det er derfor ikke verdt å prøve å spare på drivstofftanken eller drivstoffsystemet. Velg NOMAX.NO for å vćre sikker på at du får best mulig kvalitet.<br /><br />Lurer du på om den valgte drivstofftanken er riktig for bilen din? Ta kontakt med oss på telefon eller send en e-post. Våre eksperter svarer gjerne på alle dine spørsmål og vil gjerne hjelpe deg med å velge de riktige delene som passer til bilen din.<br /> </div><p><span style="font-size:18px;">- 2stk</span></p><p><span style="font-size:18px;">- høy kvalitet</span></p><p><span style="font-size:18px;">- bredde 12mm</span></p>

CODE:

import re
regex = r'<div style="font-size: 18px">.*?<!--googleoff: index-->.*?</div>'
input = '<div style="font-size: 18px"><p><span style="font-size:18px;"><strong>Drivstofftankmonteringsdeler - Toyota Rav 4 2000-2006</strong></span></p></div><div style="font-size: 18px"> </div><div style="font-size: 18px"><!--googleoff: index-->En drivstofftank er en viktig del av bilen. Hvilken som helst motor er avhengig av drivstoffsystem med korrekt funksjon og bare den beste kvaliteten garanterer sikker kjøring. Det er derfor ikke verdt å prøve å spare på drivstofftanken eller drivstoffsystemet. Velg NOMAX.NO for å vćre sikker på at du får best mulig kvalitet.<br /><br />Lurer du på om den valgte drivstofftanken er riktig for bilen din? Ta kontakt med oss på telefon eller send en e-post. Våre eksperter svarer gjerne på alle dine spørsmål og vil gjerne hjelpe deg med å velge de riktige delene som passer til bilen din.<br /> </div><p><span style="font-size:18px;">- 2stk</span></p><p><span style="font-size:18px;">- høy kvalitet</span></p><p><span style="font-size:18px;">- bredde 12mm</span></p>'
output = re.sub(regex, "", input)
print(output)
Small Atom
  • 164
  • 13
  • what do you mean by remove div. Do you want to extract span tags followed by p tags? – abhilb Feb 04 '20 at 12:38
  • Would decompose work as [described here](https://stackoverflow.com/questions/32063985/deleting-a-div-with-a-particlular-class-using-beautifulsoup) or [here](https://kaijento.github.io/2017/03/30/beautifulsoup-removing-tags/)? – DarrylG Feb 04 '20 at 12:43
  • I need remove container in exaple div when i have inside – Small Atom Feb 04 '20 at 12:50

1 Answers1

1

You could use a regular expression for this. Furthermore, you might find an online interface helpful, like this one because regular expressions can be fickle and operate differently between languages/libraries and have flags (case insensitivity, unicode support, ...).

The one I came up with for your problem is the following:

<div><!--googleoff: index-->.*?</div>

What does the mumbo jumbo .*? mean?

  • . means 'matches any character'
  • * means 'match the preceding thing any number of times [including zero]'
  • `?' means 'make the previous matcher non-greedy'

I am no regex (common abbreviation for regular expressions) god, but the latter is python specific or at least not universal. Some regex engines might not support them or use a different way of specifying it. So what do they mean together?

  • .* means 'match any character any number of times' (basically, anything goes)
  • .*? means 'match any character any number of times but prefer shorter'

By default, regex matches are greedy in most engines. Why do we want non-greedy? In case you have repeats, imagine we have the input:

<span>TEXT</span><div><!--googleoff: index--> some other text</div><p>Some string</p><div><!--googleoff: index--> some more text</div>

The greedy approach (without ?) would lead to the following incorrect output:

<span>TEXT</span>

instead of

<span>TEXT</span><p>Some string</p>

So, how do you execute this in Python? Like so:

import re
regex = r"<div><!--googleoff: index-->.*?</div>"
input = "<span>TEXT</span><div><!--googleoff: index--> some other text</div><p>Some string</p>"
output = re.sub(regex, "", input)
print(output)

It is good practice to use r as a prefix for your regexes as it simplifies escaping though it makes no difference in this case I think, but I prefer not to have the risk.

Note that in this answer I am ignoring that this is some kind of HTML/XML/structured text and that there are ways of doing this that will actually parse it and allow you to traverse the tree of elements et cetera. This can be a good way of approaching it too, but for a script can be overkill and have unintended consequences (does it round-trip to the same source bar removing this div? I wouldn't put my hand in the fire for that). This however also means that there are limitations (e.g. if there is another div tag within the div tag being removed it will not work correctly. This cannot be fixed within a regex due to being too complex and would require one to use a parser as it requires a stack.

Cryvate
  • 321
  • 3
  • 13
  • THX, but like u said its problem when i have other div in code. I edit post look about this string – Small Atom Feb 04 '20 at 13:20
  • The code/regex above should still work. There are allowed to be other `div`s in the string, just not within the same `div` that you want to remove: `
    some
    other
    text
    ` would 'break' the code/regex. If it's just plain text, it'll work fine.
    – Cryvate Feb 04 '20 at 13:24
  • but when div is before – Small Atom Feb 04 '20 at 13:29
  • You mean something like: `
    some other text
    `, that would still work find because the regex is non-greedy.
    – Cryvate Feb 04 '20 at 13:35
  • yeah now i see it shoud r'
    .*?
    ' thx!!
    – Small Atom Feb 04 '20 at 13:37
  • I've updated the answer. If you're just scraping a few webpages and need to clear the html, it will do just fine, if you want to use it in 'production code' e.g. something that will run automated or the input might vary change or might be used by third-parties I would recommend looking into a [proper parser](https://docs.python.org/3.8/library/html.parser.html). On the other hands, regexes are everywhere, so it's good to add them to your toolbox, and by fixing my regex you've taken your first step! – Cryvate Feb 04 '20 at 13:41
  • I hope that there will be no exception to this rule. Because there are many such descriptions. I need learn hard Regex. THX for help – Small Atom Feb 04 '20 at 13:52
  • One more question: some times in this fiv are otger style like:
    i try this:
    .*?
    but its remove div before
    – Small Atom Feb 04 '20 at 14:19