Assuming you are able to get hold of the text as a string before you need to remove the unwanted parts, you can search and replace.
If (adsbygoogle = window.adsbygoogle || []).push({});
is always the exact same string (including the same whitespace every time), then you can use str.replace()
.
See How to use string.replace() in python 3.x.
If the text is not the exact same thing every time--and I am guessing that at least the second example you showed is not the same every time--then you can use regular expressions. See the python documentation of the re
module.
If you only use a few regular expressions in your program you can just call re.sub
,
something like this:
sanitized_text = re.sub(regularexpression, '', original_text, flags=re.MULTILINE|re.DOTALL)
It may take some trial and error get get pattern
to match every case that is like the second example.
You'll need re.MULTILINE
if there are newlines inside the retrieved article, as there almost certainly will be, and re.DOTALL
in order to make certain regex patterns work across line boundaries, which it appears the second example will require.
If you end up having to use several regular expressions you can compile them using re.compile
before you start scraping:
pattern = re.compile(regularexpression, flags=re.MULTILINE|re.DOTALL)
Later, when you have text to remove pieces from, you can do the search and replace like this:
sanitized_text = pattern.sub('', original_text)