0

So I have looked through stack overflow but I cannot seem to find an answer to my problem. How do I get the text, specific text, after a < br > tag?

This is my Code:

product_review_container = container.findAll("span",{"class":"search_review_summary"})
for product_review in product_review_container:
    prr = product_review.get('data-tooltip-html')
    print(prr)

This is the output:

Very Positive<br>86% of the 1,013 user reviews for this game are positive.

I want in this string only the 86% and also seperatly only the 1,013. So the numbers only. However it is not an int so I do not know what to do.

Here is where the text comes from:

   [<span class="search_review_summary positive" data-tooltip-html="Very Positive&lt;br&gt;86% of the 1,013 user reviews for this game are positive.">
</span>]

Here is the link from where I am getting the information: https://store.steampowered.com/search/?specials=1&page=1

Thank you!

Sofelia
  • 109
  • 9

2 Answers2

2

You need to use regex here!

import re

string = 'Very Positive<br>86% of the 1,013 user reviews for this game are positive.'
a = re.findall('(\d+%)|(\d+,\d+)',string)
print(a)

output: [('86%', ''), ('', '1,013')]
#Then a[0][0] will be 86% and a[1][1] will be 1,013

Where \d is any number character in the string, and the + is there are at least 1 or more digits.

If you need more specific regex then you can trying it in https://regex101.com

  • Thanks! this worked, I had a little problem with the , in between 1,013. had to make it into a . (dot) and then do replace(",","") but it works now :) – Sofelia Mar 04 '19 at 00:11
1

There's a non-regex way to do it; admittedly somewhat convoluted, but still fun:

First, we borrow (and modify) this nice function:

def split_and_keep(s, sep):
         if not s: return [''] # consistent with string.split()
         p=chr(ord(max(s))+1)
         return s.replace(sep, sep+p).split(p)

Then we go through some standard steps:

html = """
  [<span class="search_review_summary positive" data-tooltip-html="Very    Positive&lt;br&gt;86% of the 1,013 user reviews for this game are positive."></span>]
  """

from bs4 import BeautifulSoup as bs4
soup = bs4(html, 'html.parser')
info = soup.select('span')[0].get("data-tooltip-html")
print(info)

Output so far, is:

Very Positive<br>86% of the 1,013 user reviews for this game are positive.

Next we go:

data = ''.join(c for c in info if (c.isdigit()) or c == '%')
print(data)

Output is a little better now:

86%1013

Almost there; now the pièce de résistance:

split_and_keep(data, '%')

Final output:

['86%', '1013']
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45