value exceed number of set rule

Question

I have a csv files with a column name 'Body' with mix of normal character and UNICODE character. However, I am now trying to figuring out on how to detect it. For normal character, I've able to code as below;

df.loc[(df['UDH'].isnull()) & df['Body'].str.len().gt(156), 'Double'] = '1'
df.loc[(df['UDH'].notnull()) & (df['Body'].str.len().gt(153)), 'Double'] = '1'

Above is my current code where I've filtered based on multiple column and if exceed the number of character it will assign column 'Double' to 1 for a normal character.

When I tried with row consist UNICODE character, it didn't work. My codes with UNICODE as below;

df.loc[(df['UDH'].isnull()) & df['Body'].str.len().gt(66), 'Double'] = '1'
df.loc[(df['UDH'].notnull()) & (df['DCS']=='0') & (df['Body'].str.len().gt(63)), 'Double'] = '1'

Example some of UNICODE character, also contain different foreign language such as Mandarin, Tamil, Punjabi, Bulgarian

       Body
è¯·å‹¿å°†æ‚¨çš„å–æ¬¾ä»£ç 1737958ç»™ä»–äºº
ਹੈਲੋ ਤੁਹਾਨੂੰ ਮਿਲ ਕੇ ਚੰਗਾ ਲੱਗਿਆ

Appreciate your suggestion on this and thank you in advance :)

EDIT:

For unicode character type;

df.loc[(df['UDH'].notnull()) & (df['DCS']=='0') & (df['Body'].astypes('UTF8').len().gt(66)), 'Double'] = '1'

gave me an error as below:

Traceback (most recent call last):
  File "/Users/syafiq/opt/anaconda3/lib/python3.7/tkinter/__init__.py", line 1705, in __call__
    return self.func(*args)
  File "/Users/syafiq/Downloads/RoutingPractice01.py", line 47, in main
    df.loc[(df['UDH'].notnull()) & (df['DCS']=='0') & (df['Body'].astypes('UTF8').len().gt(66)), 'Double'] = '1'
  File "/Users/syafiq/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5179, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'astypes'

It seems endoding is broken in first sample line, what is exected length of both rows? — jezrael, Mar 08 '20 at 08:53
for unicode character, it consist of both depends on 'UDH' column, if it is not empty, then the maximum is 63 whereas null 'UDH' column the maximum of character is 66 — hula-hula, Mar 08 '20 at 08:55
ya, but I think it is hard find length, if not consistent encoding, most time `UTF8` — jezrael, Mar 08 '20 at 08:58
I've used an online tool to detect it, it shows as a unicode, hence, for df['Body'].datatype('UTF8').len.gt(66)? — hula-hula, Mar 08 '20 at 09:05
I've tried with df['Body'].dtypes('UTF8') it gave me an error TypeError: 'numpy.dtype' object is not callable It's weird as I didn't use numpy on this code line, I'll update my post — hula-hula, Mar 08 '20 at 09:13
it gave me an error AttributeError: 'Series' object has no attribute 'astypes' — hula-hula, Mar 08 '20 at 09:16
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/209245/discussion-between-syafiq-rosli-and-jezrael). — hula-hula, Mar 08 '20 at 10:04

value exceed number of set rule

0 Answers0