0

I am using the requests_html library to scrape a website but i am getting at the same time the adsense from that website from that grabbed text. The example looks something like this:

some text some text some text some text and then this: (adsbygoogle = window.adsbygoogle || []).push({});

some text some text some text after a line break and then this: sas.cmd.push(function() { sas.call("std", { siteId: 301357, // pageId: 1101926, // Page : Seneweb_AF/rg formatId: 49048, // Format : Pave 2 300x250 target: '' // Ciblage }); });

Now how can i get rid of the italic-bold text above?

moctarjallo
  • 1,479
  • 1
  • 16
  • 33

2 Answers2

0

If requests_html doesn't have a builtin mechanism for handling this, then a solution is to use pure python; this is what i found so far:

curated_article = article.text.split('\n')
curated_article = "\n".join(list(filter(lambda a: not a.startswith("&#"), curated_article)))
print(curated_article)

where article is the html for a scraped article

moctarjallo
  • 1,479
  • 1
  • 16
  • 33
0

Assuming you are able to get hold of the text as a string before you need to remove the unwanted parts, you can search and replace.

If (adsbygoogle = window.adsbygoogle || []).push({}); is always the exact same string (including the same whitespace every time), then you can use str.replace(). See How to use string.replace() in python 3.x.

If the text is not the exact same thing every time--and I am guessing that at least the second example you showed is not the same every time--then you can use regular expressions. See the python documentation of the re module. If you only use a few regular expressions in your program you can just call re.sub, something like this:

sanitized_text = re.sub(regularexpression, '', original_text, flags=re.MULTILINE|re.DOTALL)

It may take some trial and error get get pattern to match every case that is like the second example.

You'll need re.MULTILINE if there are newlines inside the retrieved article, as there almost certainly will be, and re.DOTALL in order to make certain regex patterns work across line boundaries, which it appears the second example will require.

If you end up having to use several regular expressions you can compile them using re.compile before you start scraping:

pattern = re.compile(regularexpression, flags=re.MULTILINE|re.DOTALL)

Later, when you have text to remove pieces from, you can do the search and replace like this:

sanitized_text = pattern.sub('', original_text)
David K
  • 3,147
  • 2
  • 13
  • 19