Handling unicode characters of http User-agents in python

Question

I am completely new to python but I found a package that I need to use and am testing it. The python package in question is pywurfl.

I have created a simple code based on the example given by reading the User-agent (UA) strings from a column in a simple text file. There are a very large number of UAs (some might have foreign characters). Now the file containing the UAs has been produced with the bash output command ">" and a perl script. For example, perl somescript.pl > outfile.txt.

However, when running the following code in that file I get an error.

#!/usr/bin/python

import fileinput
import sys

from wurfl import devices
from pywurfl.algorithms import LevenshteinDistance


for line in fileinput.input():
    line = line.rstrip("\r\n")    # equiv of chomp
    H = line.split('\t')

    if H[27]=='Mobile':

        user_agent = H[23].decode('utf8')           
        search_algorithm = LevenshteinDistance()
        device = devices.select_ua(user_agent, search=search_algorithm)

        sys.stdout.write( "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % (user_agent, device.devid, device.devua, device.fall_back, device.actual_device_root, device.brand_name, device.marketing_name, device.model_name, device.device_os, device.device_os_version, device.mobile_browser, device.mobile_browser_version, device.model_extra_info, device.pointing_method, device.has_qwerty_keyboard, device.is_tablet, device.has_cellular_radio, device.max_data_rate, device.wifi, device.dual_orientation, device.physical_screen_height, device.physical_screen_width,device.resolution_height, device.resolution_width, device.full_flash_support, device.built_in_camera, device.built_in_recorder, device.receiver, device.sender, device.can_assign_phone_number, device.is_wireless_device, device.sms_enabled) + "\n")

    else:
        # do something else
        pass

Here H[23] is the column that has the UA string. but I get an error that looks like

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte

When I replaced 'utf8' with 'latin1' I got the following error

 sys.stdout.write(................) # with the .... as in the code
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128).

Am I doing anything wrong here? I need to convert the UA string in Unicode because the package is so. I am not too well versed in Unicode, especially in python. How would I handle this error? For instance, find out the UA string that is giving this error so that I can make a more informed question?

score 2 · Accepted Answer · edited May 23 '17 at 11:51

2

Looks like you have 2 separate problems.

The first is that you're assuming the input file is utf-8, when it's not. Changing the input coding to latin-1 addresses that issue.

The second issue is that your stdout seems to be set up for ascii output, so the write fails. For that, this question may help.

edited May 23 '17 at 11:51

Community

1
1

answered Jan 07 '11 at 17:43

David Gelhar

27,873
3
67
84

Handling unicode characters of http User-agents in python

1 Answers1