How to remove \xa0 from string in Python?

Question

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):

EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?

tried that already, 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128) — zhuyxn, Jun 12 '12 at 09:19
tried using str.replace(u'\xa0', ' ') but got "u"s everywhere instead of \xa0s :/ — zhuyxn, Jun 12 '12 at 09:30
If the string is the unicode one, you have to use the `u' '` replacement, not the `' '`. Is the original string the unicode one? — pepr, Jun 12 '12 at 10:51

score 410 · Answer 1 · edited Jun 11 '19 at 01:45

410

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u'\xa0', u' ')

When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.

Read up on http://docs.python.org/howto/unicode.html.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now

edited Jun 11 '19 at 01:45

TFD

23,890
2
34
51

answered Jul 19 '12 at 17:42

samwize

25,675
15
141
186

16

I don't know a huge amount about Unicode and character encodings.. but it seems like [unicodedata.normalize](http://docs.python.org/2/library/unicodedata.html#unicodedata.normalize) would be more appropriate than str.replace – dbr Sep 09 '13 at 07:45
Yours is workable advice for strings, but note that all references to this string will also need to be replaced. For example, if you have a program that opens files, and one of the files has a non-breaking space in its name, you will need to *rename* that file in addition to doing this replacement. – Sep 23 '14 at 10:52
4

[U+00a0 is a non-breakable space Unicode character](http://codepoints.net/U+00a0) that can be encoded as `b'\xa0'` byte in latin1 encoding, as two bytes `b'\xc2\xa0'` in utf-8 encoding. It can be represented as ` ` in html. – jfs Jan 20 '15 at 12:39
4

When I try this, I get `UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 397: ordinal not in range(128)`. – jds May 28 '15 at 22:15
I tried this code on a list of strings, it didn't do anything, and the \xa0 character remained. If I reencoded my text file to UTF-8, the character would appear as an upper case A with a carrot on it's head, and I encoded it in Unicode the Python interpreter crashed. – Mushroom Man Jul 20 '16 at 22:02
@dbr `unicodedata` does **not** replace `\xa0` with `NFC` (which properly retains letters with accent such as `é`). Example: `unicodedata.normalize("NFC", "LEFT\xa0RIGHT") == "LEFT\xa0RIGHT"`. – Jean Monet Apr 01 '22 at 08:21

score 325 · Answer 2 · answered Jan 08 '16 at 04:24

325

There's many useful things in Python's unicodedata library. One of them is the .normalize() function.

Try:

new_str = unicodedata.normalize("NFKD", unicode_str)

Replacing NFKD with any of the other methods listed in the link above if you don't get the results you're after.

answered Jan 08 '16 at 04:24

Jamie

3,358
1
9
6

This did the trick. Had some HTML generated by... Microsoft Word with lots of weird unicode characters and this somehow cleaned them all. – José Tomás Tocino Jun 04 '17 at 19:06
4

Not so sure, you may want `normalize('NFKD', '1º\xa0dia')` to return '1º dia' but it returns '1o dia' – Faccion Nov 08 '17 at 14:58
5

here is the [docs about `unicodedata.normalize`](https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize) – TT-- Dec 04 '17 at 15:04
3

ah, if text is 'KOREAN', do not try this. 글자가 전부 깨져버리네요. – Cho Oct 17 '19 at 09:05
3

This solution changes Russian letter `й` to an identically looking sequence of two unicode characters. The problem here is that strings that used to be equal do not match anymore. Fix: use `"NFKC"` instead of `"NFKD"`. – Markus Apr 21 '20 at 19:23
It doesn't chatch the 'soft hyphen' (-) which is '\xad' in Latin1. Are there any trick to also catch this symbol? – the_economist Jul 10 '20 at 11:06
@Markus: The same applies to the German Umlaute ö, ü and ä. 'NFKC' is required instead of 'NFKD'. – the_economist Jul 10 '20 at 11:27
3

This is awesome. It changes the one-letter string `﷼` to the four-letter string `ریال` that it actually is. So it's much easier to replace when needed. You'd normalize and then replace, without having to care which one it was. `normalize("NFKD", "﷼").replace("ریال", '')`. – Amir Shabani Apr 29 '21 at 07:55

score 39 · Answer 3 · edited Jul 18 '23 at 21:27

After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0 characters from parsed HTML string.

Assume we have our raw html as following:

raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'

So lets try to clean this HTML string:

from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.

Method # 1 (Recommended): The first one is BeautifulSoup's get_text method with strip argument as True So our code becomes:

clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

Method # 2: The other option is to use python's library unicodedata, specifically unicodedata.normalize

import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

I have also detailed these methods on this blog which you may want to refer.

this is very specific for raw html returning unicode after cleaning with bs4 or regex. Works perfectly, but it will not remove line breaks or tabs — Y4RD13, May 09 '22 at 12:18

score 30 · Answer 4 · answered Jul 21 '15 at 21:50

30

Try using .strip() at the end of your line line.strip() worked well for me

answered Jul 21 '15 at 21:50

user3590113

517
7
13

This works if it is at the beginning or end of the string. Use `replace` for any others – 8bitjunkie Mar 04 '23 at 15:17

score 20 · Answer 5 · answered Jun 12 '12 at 09:20

20

try this:

string.replace('\\xa0', ' ')

answered Jun 12 '12 at 09:20

user278064

9,982
1
33
46

6

@RyanMartin: this replaces **four bytes**: `len(b'\\xa0') == 4` but `len(b'\xa0') == 1`. If possible; you should fix upstream that generates these escapes. – jfs Jan 20 '15 at 12:43
4

This solution worked for me: `string.replace('\xa0', ' ')` – Jenya Pu Jul 04 '20 at 14:31

score 17 · Answer 6 · answered Apr 23 '19 at 07:16

17

Python recognize it like a space character, so you can split it without args and join by a normal whitespace:

line = ' '.join(line.split())

answered Apr 23 '19 at 07:16

Max

1,634
1
19
36

score 15 · Answer 7 · edited Jun 20 '20 at 09:12

15

I ran into this same problem pulling some data from a sqlite3 database with python. The above answers didn't work for me (not sure why), but this did: line = line.decode('ascii', 'ignore') However, my goal was deleting the \xa0s, rather than replacing them with spaces.

I got this from this super-helpful unicode tutorial by Ned Batchelder.

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 11 '12 at 20:39

15

You are now removing anything that isn't a ASCII character, you are probably masking your actual problem. Using `'ignore'` is like shoving through the shift stick even though you don't understand how the clutch works.. – Martijn Pieters Dec 11 '12 at 20:58
@MartijnPieters The linked unicode tutorial is good, but you are completely correct - `str.encode(..., 'ignore')` is the Unicode-handling equivalent of `try: ... except: ...`. While it might hide the error message, it rarely solves the problem. – dbr Sep 09 '13 at 07:43
2

for some purposes like dealing with EMAIL or URLS it seems perfect to use `.decode('ascii', 'ignore')` – andilabs Dec 12 '14 at 10:15
2

[samwize's answer](http://stackoverflow.com/a/11566398/4279) didn't work for you because it works on **Unicode** strings. `line.decode()` in your answer suggests that your input is a **bytestring** (you should not call `.decode()` on a Unicode string (to enforce it, the method is removed in Python 3). I don't understand how it is possible to see [the tutorial that you've linked in your answer](http://nedbatchelder.com/text/unipain.html) and miss the difference between bytes and Unicode (do not mix them). – jfs Jan 20 '15 at 12:49

score 13 · Answer 8 · answered Mar 20 '17 at 13:04

13

Try this code

import re
re.sub(r'[^\x00-\x7F]+','','paste your string here').decode('utf-8','ignore').strip()

answered Mar 20 '17 at 13:04

shiva

429
1
6
18

This works if it is at the beginning or end of the string. Use `replace` for any others – 8bitjunkie Mar 04 '23 at 15:17

andilabs · Answer 9 · 2015-06-10T15:30:45.733

9

I end up here while googling for the problem with not printable character. I use MySQL UTF-8 general_ci and deal with polish language. For problematic strings I have to procced as follows:

text=text.replace('\xc2\xa0', ' ')

It is just fast workaround and you probablly should try something with right encoding setup.

edited Jun 10 '15 at 15:30

answered Feb 22 '14 at 12:09

andilabs

22,159
14
114
151

2

this works if `text` is a bytestring that represents a text encoded using utf-8. If you are working with text; decode it to Unicode first (`.decode('utf-8')`) and encode it to a bytestring only at the very end (if API does not support Unicode directly e.g., `socket`). All intermediate operations on the text should be performed on Unicode. – jfs Jan 20 '15 at 12:57

score 7 · Answer 10 · edited Jan 19 '15 at 15:25

7

In Beautiful Soup, you can pass get_text() the strip parameter, which strips white space from the beginning and end of the text. This will remove \xa0 or any other white space if it occurs at the start or end of the string. Beautiful Soup replaced an empty string with \xa0 and this solved the problem for me.

mytext = soup.get_text(strip=True)

edited Jan 19 '15 at 15:25

shauryachats

9,975
4
35
48

answered Jan 19 '15 at 14:51

Mark

71
1
2

11

`strip=True` works only if ` ` is at the beginning or end of each bit of text. It won't remove the space if it is inbetween other characters in the text. – jfs Jan 20 '15 at 13:01

8bitjunkie · Answer 11 · 2023-03-04T15:13:26.997

7

In Python, \xa0 is a character escape sequence that represents a non-breaking space.

A non-breaking space is a space character that prevents line breaks and word wrapping between two words separated by it.

You can get rid of them by running replace on a string which contains them:

my_string.replace('\xa0', '') # no more xa0

edited Mar 04 '23 at 15:13

answered Mar 06 '19 at 17:23

8bitjunkie

12,793
9
57
70

6

This will only remove it if it's at the beginning or end of the string. – Bill Jan 18 '21 at 23:55

dda · Answer 12 · 2012-09-26T05:55:30.613

4

0xA0 (Unicode) is 0xC2A0 in UTF-8. .encode('utf8') will just take your Unicode 0xA0 and replace with UTF-8's 0xC2A0. Hence the apparition of 0xC2s... Encoding is not replacing, as you've probably realized now.

edited Sep 26 '12 at 05:55

answered Jun 12 '12 at 12:02

dda

6,030
2
25
34

1

`0xc2a0` is ambiguous (byte order). Use `b'\xc2\xa0'` bytes literal instead. – jfs Jan 20 '15 at 13:03

score 3 · Answer 13 · edited Jan 30 '21 at 14:54

3

You can try string.strip()
It worked for me! :)

edited Jan 30 '21 at 14:54

STA

30,729
8
45
59

answered Jan 30 '21 at 14:13

SaemaMiftah

39
2

This works if it is at the beginning or end of the string. Use `replace` for any others. – 8bitjunkie Mar 04 '23 at 15:16

ranaFire · Answer 14 · 2018-08-30T06:23:59.990

1

Generic version with the regular expression (It will remove all the control characters):

import re
def remove_control_chart(s):
    return re.sub(r'\\x..', '', s)

edited Aug 30 '18 at 06:23

answered Jul 02 '18 at 12:28

ranaFire

7
5

score 1 · Answer 15 · answered Jul 06 '22 at 03:13

This is how I solved this issue as I encountered \xao in html encoded string.

I discovered a None breaking space is inserted to ensure that a word and subsequent HTML markup is not separated due to resizing of a page.

This presents a problem for the parsing code as it introduced codec encoding issues. What made it hard was that we are not privy to the encoding used. From Windows machines it can be latin-1 or CP1252 (Western ISO), but more recent OSes have standardized to UTF-8. By normalizing unicode data, we strip \xa0

my_string = unicodedata.normalize('NFKD', my_string).encode('ASCII', 'ignore')

score 0 · Answer 16 · edited Apr 29 '23 at 17:12

0

Was facing the same issue, got this done and went well.

df = df.replace(u'\xa0', u'', regex=True)

All instances of \xa0 get replaced.

edited Apr 29 '23 at 17:12

petezurich

9,280
9
43
57

answered Apr 01 '23 at 09:32

Chirag

13
5

Is this python3? It says replace not taking any keyword args – emanuel sanga Apr 07 '23 at 16:06
Yes, I am working on Python 3.9. – Chirag Apr 13 '23 at 01:41
Is df a string? replace not taking any kwargs. – emanuel sanga Apr 13 '23 at 02:54
Can you ping me on telegram @t.me/lightema – emanuel sanga Apr 13 '23 at 02:54
df is a python dataframe here which has so many of these values that need cleaning from different columns – Chirag Apr 14 '23 at 10:49
Ahaa, now I get you. – emanuel sanga Apr 23 '23 at 06:25

How to remove \xa0 from string in Python?

16 Answers16

Linked

Related