1

Currently struggling with the following output in .csv where their is various random character within the is the players names and values where there shouldn't be

(I've given a picture below of the output)

I'm wondering where I'm going wrong in the code where I'm struggling to eliminate the random characters

I'm trying to remove the characters below such as Â, Ã, ©, ‰ and so on. Any suggestions?

Python Code

#importing

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 
       'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like 
Gecko) Chrome/47.0.2526.106 Safari/537.36'}

#calling websites
page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

#calling players names
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
#Let's look at the first name in the Players list.
Players[0].text

#calling value of players
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})
#Let's look at the first name in the Values list.
Values[0].text

PlayersList = []
ValuesList = []

for i in range(0,25):
   PlayersList.append(Players[i].text)
   ValuesList.append(Values[i].text)

df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})

df.to_csv('2000.csv', index=False)

df.head()

====================================================================

My Excel output

The Output

Community
  • 1
  • 1
  • Have you tried `pageTree.text` instead of `pageTree.content` so that the encoding is automatically handled? Might find that'll sort everything out for you. – Jon Clements Dec 15 '18 at 20:36
  • Those characters are clearly UTF-8 characters (like é => é) seen from an ISO-8859-1 or Windows codepage 1252 context. The output is probably correct, it's the system you are viewing it from that is not set to UTF-8. – Walter Tross Dec 15 '18 at 20:39
  • oh, I have to correct myself. The problem is Excel: https://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically – Walter Tross Dec 15 '18 at 20:44
  • @JonClements Hey Jon, tried it out and it doesn't change the characters – Kevin Carmody Dec 15 '18 at 20:53
  • @WalterTross thank you for the link, I changed the encoding in excel after importing and worked – Kevin Carmody Dec 15 '18 at 20:56
  • @Kevin might just want to use `df.to_excel(...)` directly if you don't require CSVs... that way you also get nice goodies such as types preserved and other formatting etc... – Jon Clements Dec 15 '18 at 20:58
  • @Kevin I cannot try out my solution right now, but I'm pretty sure it works. Have you tried it out? – Walter Tross Dec 15 '18 at 21:06
  • @WalterTross Hi Walter, Yes it working perfectly now. Many Thanks – Kevin Carmody Dec 15 '18 at 21:08
  • @KevinCarmody IF what is working now is the code suggested by my answer, may I ask you to accept it? Upvoting and accepting is what makes StackOverflow tick. – Walter Tross Dec 15 '18 at 21:22

3 Answers3

4
...
utf8_bom = '\xEF\xBB\xBF'
with open('2000.csv', 'w') as csv_file:
    csv_file.write(utf8_bom)
    df.to_csv(csv_file, index=False, mode='a')

Explanation: The BOM is the byte order mark (q.v.). If Excel finds it at the beginning of the CSV file, it uses it to determine the encoding, which in your case is UTF-8 (the default encoding – correctly – for Python 3).


EDIT

As Mark Tolonen pointed out, the compact version of the above is the following code:

df.to_csv('2000.csv', encoding='utf-8-sig', index=False)

The -sig in the name of the encoding stands for “signature”, i.e., the BOM at the beginning which is used by Microsoft software to detect the encoding. See also the Encodings and Unicode section of the codecs manual.

Walter Tross
  • 12,237
  • 2
  • 40
  • 64
2

You system seems to be writing the file encoded as UTF-8. Excel expects UTF-8 files to have a BOM signature, else it assumes a text file is encoded in a locale-specific ANSI encoding. This is for backward compatibility due to Windows existing before UTF-8 did.

Python has an encoding that writes the UTF-8 BOM signature, utf-8-sig, so simply use:

df.to_csv('2000.csv', encoding='utf-8-sig', index=False)
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

UPDATE:

I've fixed this situation from the following answer in the link below..

https://stackoverflow.com/a/6488070/10675615

  1. Save the exported file as a csv in the cmd prompt
  2. Open Excel
  3. Import the data using Data-->Import External Data/ Get Text/CSV --> Import Data
  4. Select the file type of "csv" and browse to your file
  5. In the import wizard change the File_Origin to "65001 UTF" (or choose the correct language character identifier)
  6. Change the Delimiter to comma
  7. Select where to import to and Finish This way the special characters should show correctly.

**