-1

I am crawling several websites and extract the names of the products. In some names there are errors like this:

Malecon 12 Jahre 0,05 ltr.<br>Reserva Superior
Bols Watermelon Lik\u00f6r 0,7l
Hayman\u00b4s Sloe Gin
Ron Zacapa Edici\u00f3n Negra
Havana Club A\u00f1ejo Especial
Caol Ila 13 Jahre (G&amp;M Discovery)

How can I fix that? I am using xpath and re.search to get the names.

In every Python file, this is the first code: # -*- coding: utf-8 -*-

Edit:

This is the sourcecode, how I get the information.

if '"articleName":' in details:
                            closer_to_product = details.split('"articleName":', 1)[1]
                            closer_to_product_2 = closer_to_product.split('"imageTitle', 1)[0]
                            if debug_product == 1:
                                print('product before try:' + repr(closer_to_product_2))
                            try:
                                found_product = re.search(f'{'"'}(.*?)'f'{'",'}'closer_to_product_2).group(1)
                            except AttributeError:
                                found_product = ''
                            if debug_product == 1:
                                print('cleared product: ', '>>>' + repr(found_product) + '<<<')
                            if not found_product:
                                print(product_detail_page, found_product)
                                items['products'] = 'default'
                            else:
                                items['products'] = found_product

Details

product_details = information.xpath('/*').extract()
product_details = [details.strip() for details in product_details]
CIC3RO
  • 13
  • 4
  • It depends on what you are using. :) maybe a `.encode('utf-8')` will do – mama Jul 09 '20 at 18:46
  • Does this answer your question? [How to convert a string to utf-8 in Python](https://stackoverflow.com/questions/4182603/how-to-convert-a-string-to-utf-8-in-python) – Red Jul 09 '20 at 18:47
  • This is clearly not UFT-8. Any unicode sequence that contains \u00 is invalid UTF-8. It is most certainly UTF-16-BE – JoelFan Jul 09 '20 at 19:03
  • ok, dump question, but whould # -*- coding: utf-16 -*- help? – CIC3RO Jul 09 '20 at 19:15
  • 1
    This is not useful "# -*- coding: utf-8 -*-" This is strictly just for encoding of the sources, and it has no effect when executing code (and it is the default, if Python do not find strong hints the coding is done differently (e.g. BOM). Do not put UTF-16 if your code is not UTF-16 (probably nobody will use UTF16 for code. Second point \uxxxx is not about UTF16, it is just a representation of unicode code points, independent of encoding. – Giacomo Catenazzi Jul 10 '20 at 06:40
  • What it is your text? In some file? In output of your program? How do you print the output? – Giacomo Catenazzi Jul 10 '20 at 06:41
  • The text is stored in a database. – CIC3RO Jul 10 '20 at 06:45

1 Answers1

0

Where is a problem (Python 3.8.3)?

import html

strings = [
  'Bols Watermelon Lik\u00f6r 0,7l',
  'Hayman\u00b4s Sloe Gin',
  'Ron Zacapa Edici\u00f3n Negra',
  'Havana Club A\u00f1ejo Especial',
  'Caol Ila 13 Jahre (G&amp;M Discovery)',
  'Old Pulteney \\u00b7 12 Years \\u00b7 40% vol',
  'Killepitsch Kr\\u00e4uterlik\\u00f6r 42% 0,7 L']
  
for str in strings:
  print( html.unescape(str).
                encode('raw_unicode_escape').
                decode('unicode_escape') )
Bols Watermelon Likör 0,7l
Hayman´s Sloe Gin
Ron Zacapa Edición Negra
Havana Club Añejo Especial
Caol Ila 13 Jahre (G&M Discovery)
Old Pulteney · 12 Years · 40% vol
Killepitsch Kräuterlikör 42% 0,7 L

Edit Use .encode('raw_unicode_escape').decode('unicode_escape') for doubled Reverse Solidi, see Python Specific Encodings

JosefZ
  • 28,460
  • 5
  • 44
  • 83
  • This solution helps for some issues. But I still get output like this: `Old Pulteney \\u00b7 12 Years \\u00b7 40% vol` or this `Killepitsch Kr\\u00e4uterlik\\u00f6r 42% 0,7 L` is this because of the double backslash? – CIC3RO Jul 10 '20 at 07:03
  • @CIC3RO answer updated for _double_ backslash, see `strings[-2:]`… – JosefZ Jan 04 '21 at 22:25