Convert in utf16

Question

I am crawling several websites and extract the names of the products. In some names there are errors like this:

Malecon 12 Jahre 0,05 ltr.<br>Reserva Superior
Bols Watermelon Lik\u00f6r 0,7l
Hayman\u00b4s Sloe Gin
Ron Zacapa Edici\u00f3n Negra
Havana Club A\u00f1ejo Especial
Caol Ila 13 Jahre (G&amp;M Discovery)

How can I fix that? I am using xpath and re.search to get the names.

In every Python file, this is the first code: # -*- coding: utf-8 -*-

Edit:

This is the sourcecode, how I get the information.

if '"articleName":' in details:
                            closer_to_product = details.split('"articleName":', 1)[1]
                            closer_to_product_2 = closer_to_product.split('"imageTitle', 1)[0]
                            if debug_product == 1:
                                print('product before try:' + repr(closer_to_product_2))
                            try:
                                found_product = re.search(f'{'"'}(.*?)'f'{'",'}'closer_to_product_2).group(1)
                            except AttributeError:
                                found_product = ''
                            if debug_product == 1:
                                print('cleared product: ', '>>>' + repr(found_product) + '<<<')
                            if not found_product:
                                print(product_detail_page, found_product)
                                items['products'] = 'default'
                            else:
                                items['products'] = found_product

Details

product_details = information.xpath('/*').extract()
product_details = [details.strip() for details in product_details]

It depends on what you are using. :) maybe a `.encode('utf-8')` will do — mama, Jul 09 '20 at 18:46
Does this answer your question? [How to convert a string to utf-8 in Python](https://stackoverflow.com/questions/4182603/how-to-convert-a-string-to-utf-8-in-python) — Red, Jul 09 '20 at 18:47
This is clearly not UFT-8. Any unicode sequence that contains \u00 is invalid UTF-8. It is most certainly UTF-16-BE — JoelFan, Jul 09 '20 at 19:03
ok, dump question, but whould # -*- coding: utf-16 -*- help? — CIC3RO, Jul 09 '20 at 19:15
This is not useful "# -*- coding: utf-8 -*-" This is strictly just for encoding of the sources, and it has no effect when executing code (and it is the default, if Python do not find strong hints the coding is done differently (e.g. BOM). Do not put UTF-16 if your code is not UTF-16 (probably nobody will use UTF16 for code. Second point \uxxxx is not about UTF16, it is just a representation of unicode code points, independent of encoding. — Giacomo Catenazzi, Jul 10 '20 at 06:40
What it is your text? In some file? In output of your program? How do you print the output? — Giacomo Catenazzi, Jul 10 '20 at 06:41

JosefZ · Accepted Answer · 2021-01-04T22:24:06.173

0

Where is a problem (Python 3.8.3)?

import html

strings = [
  'Bols Watermelon Lik\u00f6r 0,7l',
  'Hayman\u00b4s Sloe Gin',
  'Ron Zacapa Edici\u00f3n Negra',
  'Havana Club A\u00f1ejo Especial',
  'Caol Ila 13 Jahre (G&amp;M Discovery)',
  'Old Pulteney \\u00b7 12 Years \\u00b7 40% vol',
  'Killepitsch Kr\\u00e4uterlik\\u00f6r 42% 0,7 L']
  
for str in strings:
  print( html.unescape(str).
                encode('raw_unicode_escape').
                decode('unicode_escape') )

Bols Watermelon Likör 0,7l
Hayman´s Sloe Gin
Ron Zacapa Edición Negra
Havana Club Añejo Especial
Caol Ila 13 Jahre (G&M Discovery)
Old Pulteney · 12 Years · 40% vol
Killepitsch Kräuterlikör 42% 0,7 L

Edit Use .encode('raw_unicode_escape').decode('unicode_escape') for doubled Reverse Solidi, see Python Specific Encodings

edited Jan 04 '21 at 22:24

answered Jul 09 '20 at 20:55

JosefZ

28,460
5
44
83

This solution helps for some issues. But I still get output like this: `Old Pulteney \\u00b7 12 Years \\u00b7 40% vol` or this `Killepitsch Kr\\u00e4uterlik\\u00f6r 42% 0,7 L` is this because of the double backslash? – CIC3RO Jul 10 '20 at 07:03
@CIC3RO answer updated for _double_ backslash, see `strings[-2:]`… – JosefZ Jan 04 '21 at 22:25

Convert in utf16

1 Answers1

Linked