How to use re module in python to extract information?

Question

I write a little script to use collins website for translation. heres my code:

import urllib.request
import re
def translate(search):
    base_url = 'http://www.collinsdictionary.com/dictionary/american/'
    url = base_url + search
    p = urllib.request.urlopen(url).read()
    f = open('t.txt', 'w+b')
    f.write(p)
    f.close()
    f = open('t.txt', 'r')
    t = f.read()
    m = re.search(r'(<span class="def">)(\w.*)(</span>]*)',t)
    n = m.group(2)
    print(n)
    f.close()

I have some questions:

I can't use re.search on p. it raises this error: TypeError: can't use a string pattern on a bytes-like object is there a way to use re.search without saving it?
After saving file I should reopen it to use re.search otherwise it raises this error: TypeError: must be str, not bytes why this error happens?
in this program I want to extract information between <span class="def"> and </span> from first match. but pattern that I wrote not work good in all cases. for example translate('three') is good. out put is : "totaling one more than two" but for translate('tree') out put is: "a treelike bush or shrub â‡’ a rose tree" is there a way to correct this pattern. regular expression or any other tools?

score 0 · Accepted Answer · edited May 23 '17 at 12:22

When you call read on the response returned by urllib, you get a bytes object, which you need to decode to convert it to a string.

Change

    p=urllib.request.urlopen(url).read()

to

    p=urllib.request.urlopen(url).read().decode('utf-8')

You should read this https://docs.python.org/3/howto/unicode.html to understand why because issues like this come up a lot.

Also, you probably don't want to parse HTML using regex. Some better alternatives for Python are mentioned here.

How to use re module in python to extract information?

1 Answers1