1

I have a text file from which I am reading lines and processing each line one by one.

I came across this line:

(T)he film is never sure to make a clear point – even if it seeks to rely on an ambiguous presentation.

Between point and even I have three characters , - and .

I tried printing out the characters as integers.

In Java:

String input = "(T)he film is never sure to make a clear point – even if it seeks to rely on an ambiguous presentation.";
int[] ords = new int[input.length()];
for (int i = 0; i < ords.length; i++)
    ords[i] = (int) input.charAt(i);

which gives:

[40, 84, 41, 104, 101, 32, 102, 105, 108, 109, 32, 105, 115, 32, 110, 101, 118, 101, 114, 32, 115, 117, 114, 101, 32, 116, 111, 32, 109, 97, 107, 101, 32, 97, 32, 99, 108, 101, 97, 114, 32, 112, 111, 105, 110, 116, 32, 8211, 32, 101, 118, 101, 110, 32, 105, 102, 32, 105, 116, 32, 115, 101, 101, 107, 115, 32, 116, 111, 32, 114, 101, 108, 121, 32, 111, 110, 32, 97, 110, 32, 97, 109, 98, 105, 103, 117, 111, 117, 115, 32, 112, 114, 101, 115, 101, 110, 116, 97, 116, 105, 111, 110, 46]

In Python:

def get_ords(string):
    return map(lambda x: ord(x), string)

which gives:

[40, 84, 41, 104, 101, 32, 102, 105, 108, 109, 32, 105, 115, 32, 110, 101, 118, 101, 114, 32, 115, 117, 114, 101, 32, 116, 111, 32, 109, 97, 107, 101, 32, 97, 32, 99, 108, 101, 97, 114, 32, 112, 111, 105, 110, 116, 32, 226, 128, 147, 32, 101, 118, 101, 110, 32, 105, 102, 32, 105, 116, 32, 115, 101, 101, 107, 115, 32, 116, 111, 32, 114, 101, 108, 121, 32, 111, 110, 32, 97, 110, 32, 97, 109, 98, 105, 103, 117, 111, 117, 115, 32, 112, 114, 101, 115, 101, 110, 116, 97, 116, 105, 111, 110, 46]

In java's result, the three characters , - and are represented by 8211 and in python it is represented as 226, 128, 147 i.e. '\xe2', '\x80', '\x93'. This discrepancy is resulting in different results when I process it in java and python.

I also noticed that if I remove , - and from the string, the results are same for both.

Is it possible to solve this issue without having to remove the special characters.

Animesh Pandey
  • 5,900
  • 13
  • 64
  • 130

3 Answers3

3

You're probably not using it as a unicode string in Python (u prefix in Python 2).

This can be illustrated by the following code (using the relevant part of your example):

# -*- coding: utf-8 -*-

x = u"t – e"
y = "t – e"

def get_ords(s):
    return map(lambda x: ord(x), s)

print "x: %s" % (get_ords(x),)
print "y: %s" % (get_ords(y),)

The result is:

x: [116, 32, 8211, 32, 101]
y: [116, 32, 226, 128, 147, 32, 101]

This Python documentation about Unicode should be of interest: https://docs.python.org/2/howto/unicode.html

When reading from a file, you can use codecs, otherwise, you're not reading the file as Unicode:

import codecs

with codecs.open('test.txt','r','utf-8') as f:
    x = f.read()

with open('test.txt','r') as f:
    y = f.read()

(This produces the same results as above.)

Note that, in Java, the encoding used for reading may also depend on the value of the file.encoding system property. (It depends on how you read the file, see: https://docs.oracle.com/javase/tutorial/i18n/text/stream.html )

Bruno
  • 119,590
  • 31
  • 270
  • 376
0

I would make sure the string has the same encoding in both. For example, for python I'd do something like the following to get it into utf8:

def get_ords(string):
    string = string.encode('utf-8')
    return map(lambda x: ord(x), string)
David542
  • 104,438
  • 178
  • 489
  • 842
0

While the answer given by @Bruno is very good, I was able to solve my issue using the following function:

from unidecode import unidecode

def remove_non_ascii(text):
    return unidecode(unicode(text, encoding="utf-8"))

For any string I used remove_non_ascii and the same in Java.

Animesh Pandey
  • 5,900
  • 13
  • 64
  • 130