1

i have this HTML code

<a class="button block left icon-phone" data-reveal="\u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1"  href="#">

this a sting, i want to extract content in front of data-reveal. i did some regex like

p = re.compile('data-reveal=*')

but they didn't work. How can i do it ? Thanks.

alone
  • 169
  • 7

3 Answers3

3

You are using the wrong tool for this. You should use an Html Parser like BeautifulSoup.

>>> from bs4 import BeautifulSoup
>>> doc = """<a class="button block left icon-phone" data-reveal="\u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1"  href="#">"""
>>> soup = BeautifulSoup(doc, 'html.parser')
>>> print(soup.find('a').get('data-reveal'))
۰۹۳۶۵۶۸۱۶۲۱
styvane
  • 59,869
  • 19
  • 150
  • 156
2

You shouldn't use regex for this but I'll assume you want to since that's what you do in the op. I'm not exactly sure what you want, so here's how to do either of what I think you could be asking

match everything in data-reveal:
data-reveal="(.+?)"
matches: \u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1

match EVERYTHING in front of data-reveal
data-reveal="(.+)
matches: \u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1"  href="#">

first regex: https://regex101.com/r/jW9fT4/1

second regex: https://regex101.com/r/uZ7vX2/1

Keatinge
  • 4,330
  • 6
  • 25
  • 44
2

Try this:

import re

html = """<a class="button block left icon-phone" data-reveal="\u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1"  href="#">"""

regexObj = re.compile('data-reveal="(.*)" ')
result = regexObj.search(html);
print(result.group(1))

Output:

۰۹۳۶۵۶۸۱۶۲۱
Ren
  • 2,852
  • 2
  • 23
  • 45
  • hi ren, thanks for your answer alot. now when i print the out put it shows utf-8 (\u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1) how can i convert it to plain that shows ۰۹۳۶۵۶۸۱۶۲۱ ? thanks again. – alone Apr 11 '16 at 05:20
  • thanks, i figure it out, i should use b'\u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1' – alone Apr 11 '16 at 05:45