3

I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character.

(You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.)

To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove.

for work in glob.glob(pathtofiles):

    openfile = open(work)
    readfile = openfile.read()
    stringfile = str(readfile)

    decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line?
    soup = BeautifulSoup(decodefile)

    textwithtags = soup.findAll('text')

    textwithtagsasstring = str(textwithtags)

    #this method strips everything between anglebrackets as it should
    textwithouttags = stripTags(textwithtagsasstring)

    #clean text
    nonewlines = textwithouttags.replace("\n", " ")
    noextrawhitespace = re.sub(' +',' ', nonewlines)

    print noextrawhitespace #the boxes appear

I tried to remove the boxes by using

noboxes = noextrawhitespace.replace(u"\u2610", "")

But Python threw an error flag:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128)

Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.

duhaime
  • 25,611
  • 17
  • 169
  • 224
  • 1
    Wow, who was generating XML files in the 18th century? Leibniz? – abarnert Oct 22 '13 at 21:54
  • 1
    (Leibniz indeed, but Newton beat him to the punch.) – duhaime Oct 22 '13 at 21:55
  • 1
    Meanwhile, what is the `str(readfile)` supposed to do? The `read` method on files already returns a `str`. – abarnert Oct 22 '13 at 22:02
  • 1
    The funny thing about U+2610 is that it's supposed to be an [empty ballot box](http://www.fileformat.info/info/unicode/char/2610/index.htm), but in many fonts it's not present, meaning it gets printed as a missing-character empty box, which is pretty hard to tell apart. (There are similar problems with some of the line-drawing and other empty-box characters.) – abarnert Oct 22 '13 at 22:11

3 Answers3

5

Give this a try:

noextrawhitespace.replace("\\u2610", "") 

I think you are just missing that extra '\'

This might also work.

print(noextrawhitespace.decode('unicode_escape').encode('ascii','ignore'))
jramirez
  • 8,537
  • 7
  • 33
  • 46
  • Many thanks, @jramirez, but I believe `.rstrip()` will only remove any trailing whitespace following the noextrawhitespace object. I believe I need instead something like `.replace(boxcharacter, "")` or a `re.sub()` method that will allow me to eliminate the boxcharacter. – duhaime Oct 22 '13 at 21:46
  • Thanks again, @jramirez. This method eliminates the boxes indeed, but it also eliminates the em-dashes, which I wish to retain. Is there a way to keep the em-dashes but eliminate the boxes? I am grateful for your suggestion. – duhaime Oct 22 '13 at 21:50
  • lol I edited the answer again. let me know if that does the trick. – jramirez Oct 22 '13 at 21:52
  • Thank you again, @jramirez. I think we're close, but perhaps I've misidentified the character because the boxes keep appearing, even with the replace method you identify. Is there a surefire way for me to determine which character is plaguing me? When I try copying and pasting the character into a search engine, it looks like an 'or' operator but the search yields no hits. Most puzzling... – duhaime Oct 22 '13 at 21:57
  • 1
    `for c in noextrawhitespace: print hex(c)` – jramirez Oct 22 '13 at 21:59
  • This answer is wrong. His text does not include Python-escaped Unicode. It may contain XML-charref-escaped Unicode, but (a) that won't match `\\u2610` anyway, and (b) it's already been decoded before his code sees it. – abarnert Oct 22 '13 at 22:04
4

The problem is that you're mixing unicode and str. Whenever you do that, Python has to convert one to the other, which is does by using sys.getdefaultencoding(), which is usually ASCII, which is almost never what you want.*

If the exception comes from this line:

noboxes = noextrawhitespace.replace(u"\u2610", "")

… the fix is simple… except that you have to know whether noextrawhitespace is supposed to be a unicode object or a UTF-8-encoding str object). If the former, it's this:

noboxes = noextrawhitespace.replace(u"\u2610", u"")

If the latter, it's this:

noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")

But really, you have to get all of the strings consistent in your code; mixing the two up is going to cause problems in more places than this one.


Since I don't have your XML files to test, I wrote my own:

<xml>
    <text>abc&#9744;def</text>
</xml>

Then, I added these two lines to the bottom of your code (and a bit to the top to just open my file instead of globbing for whatever):

noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
print noboxes

The output is now:

[<text>abc☐def</text>]
[<text>abc☐def</text>]
[<text>abcdef</text>]

So, I think that's what you want here.


* Sure sometimes you want ASCII… but those aren't usually the times when you have unicode objects…

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thank you so much for this helpful response, @abarnert. I took some time with it, and had to do some outside research, and it seems that by the time the script reaches the print line, the text object has been converted back into an ascii string (because of the `textwithtagsasstring` line, which converts the text to a string so that I can run the removeNonAscii() method, which takes strings as input. The trouble is, though, I tried _all three_ of the methods you suggested, but the pesky boxes are still printing to console. What am I missing? – duhaime Oct 22 '13 at 22:35
  • @duhaime: Do you really mean "converted back into an ascii string", or "converted back into a UTF-8 string"? Because the latter you can deal with; the former, it's too late… Anyway, did you try my test code out? Does it work for you? Does your XML look like that, or does it have un-charref-escaped Unicode stored in it directly? If the latter, are you sure it's UTF-8? (What are the actual bytes in the file?) – abarnert Oct 22 '13 at 22:45
  • Ah, I used `print isinstance(noextrawhitespace, unicode)` and got "False", then used `import chardet` `print chardet.detect(noextrawhitespace)` and got "{'confidence': 0.99, 'encoding': 'utf-8'}". I then used my IDE to edit my "current file settings" and select "utf-8" as my encoding. Then I could simply use `noboxes = noextrawhitespace.replace('∣', '')` except the box looks like a box in the IDE. Then noboxes prints as expected. Is this a bootleg solution? Will it introduce unexpected problems? I'm very thankful for your comments. – duhaime Oct 22 '13 at 22:57
  • 1
    @duhaime: First, if you want to put non-ASCII literals into your code, you need to add a [coding declaration](http://www.python.org/dev/peps/pep-0263/) to tell Python the file is UTF-8, not just tell your IDE that the file is UTF-8. And really, things are a lot simpler if you just don't use Unicode literals. In some cases, the readability benefits are worth the cost, but in this case, I think it will be _less_ readable. Imagine coming to your code in 6 months and trying to figure out `'☐'`, `u'\u2610'.encode('utf-8')`, or `'\xe2\x98\x90'`; won't the first one be the hardest? – abarnert Oct 22 '13 at 23:12
1

Reading your sample, the following are the non-ASCII characters in the document:

0x2223 DIVIDES
0x2022 BULLET
0x3009 RIGHT ANGLE BRACKET
0x25aa BLACK SMALL SQUARE
0x25ca LOZENGE
0x3008 LEFT ANGLE BRACKET
0x2014 EM DASH
0x2026 HORIZONTAL ELLIPSIS

\u2223 is the actual character in question in line 3682, and it is being used as a soft hyphen. The others are used in markup for tagging illegible characters, such as:

<GAP DESC="illegible" RESP="oxf" EXTENT="4+ letters" DISP="\u2022\u2022\u2022\u2022\u2026"/>

Here's some code to do what your code is attempting. Make sure to process in Unicode:

from bs4 import BeautifulSoup
import re

with open('k000039.000.xml') as f:
    soup = BeautifulSoup(f)  # BS figures out the encoding

text = u''.join(soup.strings)      # strings is a generator for just the text bits.
text = re.sub(ur'\s+',ur' ',text)  # Simplify all white space.
text = text.replace(u'\u2223',u'') # Get rid of the DIVIDES character.
print text

Output:

[[truncated]] reckon my self a Bridegroom too. Buckle. I doubt Kickey won't find him such. [Aside.] Mrs. Sago. Well,—poor Keckky's bound to good Behaviour, or she had lost quite her Puddy's Favour. Shall I for this repine at Fortune?—No. I'm glad at Heart that I'm forgiven so. Some Neighbours Wives have but too lately shown, When Spouse had left 'em all their Friends were flown. Then all you Wives that wou'd avoid my Fate. Remain contented with your present State FINIS.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Thank you for your feedback, @Mark Tolonen. I tried implementing your suggestions, which seem far faster than my Rube Goldberg approach, but I'm getting an error when I try to write to disk. I'm trying to `write()` a few tab separated fields followed by a `'\n'` each time a condition is met, but am getting an error message for the line that tries to write the '\n': `UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 0: ordinal not in range(128)`. Do you happen to know how I can resolve that error? I would be grateful for any insight you can offer. – duhaime Oct 23 '13 at 21:58
  • Use the `codecs.open` function to open the file and specify an encoding. That's the correct way to write Unicode to a file. – Mark Tolonen Oct 24 '13 at 01:52