0

Right now my output to a file is like:

<b>Nov 22–24</b>   <b>Nov 29–Dec 1</b>    <b>Dec 6–8</b> <b>Dec 13–15</b>   <b>Dec 20–22</b>   <b>Dec 27–29</b>   <b>Jan 3–5</b> <b>Jan 10–12</b>   <b>Jan 17–19</b>   <b><i>Jan 17–20</i></b>    <b>Jan 24–26</b>   <b>Jan 31–Feb 2</b>    <b>Feb 7–9</b> <b>Feb 14–16</b>   <b><i>Feb 14–17</i></b>    <b>Feb 21–23</b>   <b>Feb 28–Mar 2</b>    <b>Mar 7–9</b> <b>Mar 14–16</b>   <b>Mar 21–23</b>   <b>Mar 28–30</b>   

I want to remove all the "Â" and css tags (< b >, < / b >). I tried using the .remove and .replace functions but I get an error:

SyntaxError: Non-ASCII character '\xc2' in file -- FILE NAME-- on line 70, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

The output above is in a list, which I get from a webcrawling function:

def getWeekend(item_url):
    dates = []
    href = item_url[:37]+"page=weekend&" + item_url[37:]
    response = requests.get(href)
    soup = BeautifulSoup(response.content, "lxml")  # or BeautifulSoup(response.content, "html5lib")
    date= soup.select('table.chart-wide > tr > td > nobr > font > a > b')
    return date

I write it to a file like so:

for item in listOfDate:
    wr.writerow(item)

How can I remove all the tags so that only the date is left?

alphamonkey
  • 249
  • 5
  • 20

4 Answers4

2

I'm not sure, but I think aString.regex_replace('toFind', 'toReplace') should work. Either that or writeb it to a file, and then run sed on it like: sed -i 's/toFind/toReplace/g'

D Swartz
  • 155
  • 1
  • 11
1

The problem is that you don't have an ASCII string from the website. You need to convert the non-ASCII text into something Python can understand before manipulating it.

Python will use Unicode when given a chance. There's plenty of information out there if you just have a look. For example, you can find more help from other questions on this website:

Python: Converting from ISO-8859-1/latin1 to UTF-8

python: unicode in Windows terminal, encoding used?

What is the difference between encode/decode?

Community
  • 1
  • 1
Peter Brittain
  • 13,489
  • 3
  • 41
  • 57
1

You already got a working solution, but for the future:

  1. Use get_text() to get rid of the tags

date = soup.select('table.chart-wide > tr > td > nobr > font > a > b').get_text()

  1. Use .replace(u'\xc2',u'') to get rid of the Â. the u makes u'\xc2' a unicode string. (This might take some futzing around with encoding, but for me get_Text() is already returning a unicode object.)

(Additionally, possibly consider .replace(u'\u2013',u'-') because right now, you have an en-dash :P.)

date = date.replace(u'\xc2',u'').replace(u'\u2013',u'-')

NightShadeQueen
  • 3,284
  • 3
  • 24
  • 37
0

If your Python 2 source code has literal non-ASCII characters such as  then you should declare the source code encoding as the error message says. Put at the top of your Python file:

# -*- coding: utf-8 -*-

Make sure the file is saved using the utf-8 encoding and use Unicode strings to work with the text.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • If you are a VIm user more than an emacs one, you can instead put near the top: ``# vim:set fileencoding=utf8:``. – bufh Jun 28 '15 at 09:41
  • @bufh: Python doesn't care as long as it matches [`"coding[:=]\s*([-\w.]+)"` regular expression](https://www.python.org/dev/peps/pep-0263/). – jfs Jun 28 '15 at 09:51