1

I am completely new to python but I found a package that I need to use and am testing it. The python package in question is pywurfl.

I have created a simple code based on the example given by reading the User-agent (UA) strings from a column in a simple text file. There are a very large number of UAs (some might have foreign characters). Now the file containing the UAs has been produced with the bash output command ">" and a perl script. For example, perl somescript.pl > outfile.txt.

However, when running the following code in that file I get an error.

#!/usr/bin/python

import fileinput
import sys

from wurfl import devices
from pywurfl.algorithms import LevenshteinDistance


for line in fileinput.input():
    line = line.rstrip("\r\n")    # equiv of chomp
    H = line.split('\t')

    if H[27]=='Mobile':

        user_agent = H[23].decode('utf8')           
        search_algorithm = LevenshteinDistance()
        device = devices.select_ua(user_agent, search=search_algorithm)

        sys.stdout.write( "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % (user_agent, device.devid, device.devua, device.fall_back, device.actual_device_root, device.brand_name, device.marketing_name, device.model_name, device.device_os, device.device_os_version, device.mobile_browser, device.mobile_browser_version, device.model_extra_info, device.pointing_method, device.has_qwerty_keyboard, device.is_tablet, device.has_cellular_radio, device.max_data_rate, device.wifi, device.dual_orientation, device.physical_screen_height, device.physical_screen_width,device.resolution_height, device.resolution_width, device.full_flash_support, device.built_in_camera, device.built_in_recorder, device.receiver, device.sender, device.can_assign_phone_number, device.is_wireless_device, device.sms_enabled) + "\n")

    else:
        # do something else
        pass

Here H[23] is the column that has the UA string. but I get an error that looks like

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte

When I replaced 'utf8' with 'latin1' I got the following error

 sys.stdout.write(................) # with the .... as in the code
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128).

Am I doing anything wrong here? I need to convert the UA string in Unicode because the package is so. I am not too well versed in Unicode, especially in python. How would I handle this error? For instance, find out the UA string that is giving this error so that I can make a more informed question?

sfactor
  • 12,592
  • 32
  • 102
  • 152

1 Answers1

2

Looks like you have 2 separate problems.

The first is that you're assuming the input file is utf-8, when it's not. Changing the input coding to latin-1 addresses that issue.

The second issue is that your stdout seems to be set up for ascii output, so the write fails. For that, this question may help.

Community
  • 1
  • 1
David Gelhar
  • 27,873
  • 3
  • 67
  • 84