Replacing all instances of string in string Python

Question

Right now my output to a file is like:

<b>Nov 22Â–24</b>   <b>Nov 29Â–Dec 1</b>    <b>Dec 6Â–8</b> <b>Dec 13Â–15</b>   <b>Dec 20Â–22</b>   <b>Dec 27Â–29</b>   <b>Jan 3Â–5</b> <b>Jan 10Â–12</b>   <b>Jan 17Â–19</b>   <b><i>Jan 17Â–20</i></b>    <b>Jan 24Â–26</b>   <b>Jan 31Â–Feb 2</b>    <b>Feb 7Â–9</b> <b>Feb 14Â–16</b>   <b><i>Feb 14Â–17</i></b>    <b>Feb 21Â–23</b>   <b>Feb 28Â–Mar 2</b>    <b>Mar 7Â–9</b> <b>Mar 14Â–16</b>   <b>Mar 21Â–23</b>   <b>Mar 28Â–30</b>

I want to remove all the "Â" and css tags (< b >, < / b >). I tried using the .remove and .replace functions but I get an error:

SyntaxError: Non-ASCII character '\xc2' in file -- FILE NAME-- on line 70, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

The output above is in a list, which I get from a webcrawling function:

def getWeekend(item_url):
    dates = []
    href = item_url[:37]+"page=weekend&" + item_url[37:]
    response = requests.get(href)
    soup = BeautifulSoup(response.content, "lxml")  # or BeautifulSoup(response.content, "html5lib")
    date= soup.select('table.chart-wide > tr > td > nobr > font > a > b')
    return date

I write it to a file like so:

for item in listOfDate:
    wr.writerow(item)

How can I remove all the tags so that only the date is left?

what is the page encoding? – Padraic Cunningham Jun 27 '15 at 22:23 — Padraic Cunningham, Jun 27 '15 at 22:23

score 2 · Answer 1 · answered Jun 27 '15 at 21:48

2

I'm not sure, but I think aString.regex_replace('toFind', 'toReplace') should work. Either that or writeb it to a file, and then run sed on it like: sed -i 's/toFind/toReplace/g'

answered Jun 27 '15 at 21:48

D Swartz

155
1
11

Thanks, I'll just use the excel find and replace function, just tried it and its super easy. – alphamonkey Jun 27 '15 at 21:52

score 1 · Answer 2 · edited May 23 '17 at 12:14

The problem is that you don't have an ASCII string from the website. You need to convert the non-ASCII text into something Python can understand before manipulating it.

Python will use Unicode when given a chance. There's plenty of information out there if you just have a look. For example, you can find more help from other questions on this website:

Python: Converting from ISO-8859-1/latin1 to UTF-8

python: unicode in Windows terminal, encoding used?

What is the difference between encode/decode?

score 1 · Accepted Answer · answered Jun 27 '15 at 22:45

You already got a working solution, but for the future:

Use get_text() to get rid of the tags

date = soup.select('table.chart-wide > tr > td > nobr > font > a > b').get_text()

Use .replace(u'\xc2',u'') to get rid of the Â. the u makes u'\xc2' a unicode string. (This might take some futzing around with encoding, but for me get_Text() is already returning a unicode object.)

(Additionally, possibly consider .replace(u'\u2013',u'-') because right now, you have an en-dash :P.)

date = date.replace(u'\xc2',u'').replace(u'\u2013',u'-')

score 0 · Answer 4 · answered Jun 27 '15 at 23:04

0

If your Python 2 source code has literal non-ASCII characters such as Â then you should declare the source code encoding as the error message says. Put at the top of your Python file:

# -*- coding: utf-8 -*-

Make sure the file is saved using the utf-8 encoding and use Unicode strings to work with the text.

answered Jun 27 '15 at 23:04

jfs

399,953
195
994
1,670

If you are a VIm user more than an emacs one, you can instead put near the top: ``# vim:set fileencoding=utf8:``. – bufh Jun 28 '15 at 09:41
@bufh: Python doesn't care as long as it matches [`"coding[:=]\s*([-\w.]+)"` regular expression](https://www.python.org/dev/peps/pep-0263/). – jfs Jun 28 '15 at 09:51

Replacing all instances of string in string Python

4 Answers4