0

I write a little script to use collins website for translation. heres my code:

import urllib.request
import re
def translate(search):
    base_url = 'http://www.collinsdictionary.com/dictionary/american/'
    url = base_url + search
    p = urllib.request.urlopen(url).read()
    f = open('t.txt', 'w+b')
    f.write(p)
    f.close()
    f = open('t.txt', 'r')
    t = f.read()
    m = re.search(r'(<span class="def">)(\w.*)(</span>]*)',t)
    n = m.group(2)
    print(n)
    f.close()

I have some questions:

  1. I can't use re.search on p. it raises this error: TypeError: can't use a string pattern on a bytes-like object is there a way to use re.search without saving it?

  2. After saving file I should reopen it to use re.search otherwise it raises this error: TypeError: must be str, not bytes why this error happens?

  3. in this program I want to extract information between <span class="def"> and </span> from first match. but pattern that I wrote not work good in all cases. for example translate('three') is good. out put is : "totaling one more than two" but for translate('tree') out put is: "a treelike bush or shrub   ⇒ a rose tree" is there a way to correct this pattern. regular expression or any other tools?

Sara Santana
  • 1,001
  • 1
  • 11
  • 22

1 Answers1

0

When you call read on the response returned by urllib, you get a bytes object, which you need to decode to convert it to a string.

Change

    p=urllib.request.urlopen(url).read()

to

    p=urllib.request.urlopen(url).read().decode('utf-8')

You should read this https://docs.python.org/3/howto/unicode.html to understand why because issues like this come up a lot.

Also, you probably don't want to parse HTML using regex. Some better alternatives for Python are mentioned here.

Community
  • 1
  • 1
Andrew Magee
  • 6,506
  • 4
  • 35
  • 58