1

Basically in my student data I am having an issue where by I am getting weird sumbols in my data as you can see: MAIN £1.00when it should show MAIN £1.00

Below is a snippet of my code what scrapes a website for certain student information for their student discounts and eventually writes it to file.

# -*- coding: utf-8 -*-             
totals = main.find_all('p')
for total in totals:
    if total .find(text=re.compile("Main:")):
        total = total.get_text()
        if u"Main £" in total:
            pull1 = re.search(r'(MAIN) (\D\w+\D\d+)', total)
            pull2 = re.search(r'(MAINER) (\D\w+\D\d+)', total)
            if pull1:
                rpr_data.append(pull1.group(0).title())
                print pull1.group(0).title()
            if pull2:
                rpr_data.append(pull2.group(0).title())
                print pull2.group(0).title()
with open('RPR.txt','w') as rpr_file:
    rpr_file.write('\n'.join(rpr_data).encode("UTF-8"))

When I try and re-use this data in the script Matching three variables from textfile to csv and writing variables to the csv on matched rows even though the data in the text file has no weird  symbol when it writes to CSV the symbol comes back...

How can I permanently eradicate this  symbol correctly?

Community
  • 1
  • 1
Ryflex
  • 5,559
  • 25
  • 79
  • 148
  • 1
    First, are you sure the script is actually saved as a UTF-8 text file, rather than Latin-1/cp1252/etc.? (Just putting a coding declaration comment at the top doesn't change the coding your text editors uses, except for emacs, it just lies to Python…) – abarnert Oct 09 '13 at 23:05
  • Also, _where_ are you seeing that `MAIN £1.00`? In Notepad.exe after opening `RPR.txt`? Or…? – abarnert Oct 09 '13 at 23:07
  • @abarnert I am seeing that on certain prints. Using the line `total = total.get_text()` somehow strips the `Â` symbol but after I run the matching script in the other thread I get the `Â` re-appear in the column(s) – Ryflex Oct 09 '13 at 23:12
  • Ah, that's because your terminal's character set is Latin-1/cp1252/etc., and Python is printing UTF-8 at it. (Can you please tell us what platform you're on and, if not Windows, what terminal you use, so I don't have to keep guessing?) – abarnert Oct 09 '13 at 23:13

1 Answers1

3

Getting extra  characters before various western-european characters is almost always a sign of interpreting UTF-8 as Latin-1 (or cp1252 or some other "extended Latin-1" charset).*

That could be you receiving UTF-8 input and trying to process it as Latin-1, or you generating UTF-8 output that someone else is trying to process as Latin-1.


If you're seeing these in the output file, the most likely possibility is that your code is doing everything right every step of the way, and generating a perfectly good UTF-8 file… and then you're trying to view that file on a Windows machine whose OEM code page is 1252 in a program like Notepad that defaults to the OEM code page.

If that's it, there are two possibilities:

  1. Don't do that. View the file as UTF-8. You can tell Notepad to open a file as UTF-8 instead of the default. Or you can use a different editor/viewer.

  2. If you want the file to be viewable as cp1252, or as "whatever the OEM code page is on this machine", save it that way—e.g., change the last line to use encode("cp1252").


If you're seeing them in the print statements, the most likely possibility is that your code is doing everything right, but your terminal is a Windows DOS prompt that's again set to code page 1252. See Python, Unicode, and the Windows console and Windows cmd encoding change causes Python crash for all the different things that can be wrong here and how to work around them.


* You can see this from a quick line of Python: u'\u00a3'.encode('utf-8').decode('latin-1') == u'\u00c2\u00a3'. That u'\u00c2' is Â. Going the other way can never cause this problem: u'\u00a3'.encode('latin-1').decode('utf-8') will instead raise a UnicodeDecodeError.

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • The original encoding is: `iso-8859-1` – Ryflex Oct 09 '13 at 23:10
  • @Hyflex: The original encoding of _what_? The source code file? Some input file that you're not showing us here? Anyway, `iso-8859-1` is the same thing as Latin-1. – abarnert Oct 09 '13 at 23:12
  • The website in which some of my data is scraped from in the source it says `` – Ryflex Oct 09 '13 at 23:14
  • Latin-1 indeed. `>>> print "MAIN £1.00".decode('latin-1') -> MAIN £1.00` – Igonato Oct 09 '13 at 23:14
  • The data is pulled directly from urllib using the following line: `html = urllib2.urlopen("xx.xx.xxx.xx", timeout=10).read().decode('Latin-1')` (the web-address can't be shown as it's a direct IP) but if I print the html it still shows the wierd A – Ryflex Oct 09 '13 at 23:17
  • @Hyflex: This may seem surprising, but that's probably a red herring. You haven't shown us how you scraped the data, but you must have properly accounted for the Latin-1 on that end. Interpreting a UTF-8 `£` as Latin-1 will give you `£`, but interpreting a Latin-1 `£` as UTF-8 will give you a `UnicodeDecodeError`. – abarnert Oct 09 '13 at 23:17
  • @Hyflex: Beat me by a millisecond… Anyway, yeah, that proves it that you _are_ properly accounting for the Latin-1 on the input end. The problem is that you aren't properly accounting for the Latin-1-or-similar on the _output_ end. – abarnert Oct 09 '13 at 23:18
  • @abarnet I've changed them all to Latin-1 but if I print `total` before the line `total = total.get_text()` I get the `£` everywhere still... :/ – Ryflex Oct 09 '13 at 23:22
  • @Hyflex: You've changed _what_ all to Latin-1? Meanwhile, if the problem is on printing, did you read the questions/answers I linked to? There's way too much to explain in a comment. – abarnert Oct 09 '13 at 23:26
  • @Hyflex: Also, if you won't answer questions, it makes it very hard to help you. I asked you before for your platform, terminal, versions, etc. Without any of that information, I have to try to guess, and to write vague answers that make sense for any possibility. – abarnert Oct 09 '13 at 23:29
  • Windows, IdleGUI Python 2.7... I don't know exactly what else you need. – Ryflex Oct 10 '13 at 03:03