0

I have a dataframe with a few columns in Japanese. I want to pad those column values to match the expected column length.

Dataframe:

  StringData = StringIO(
    """agency_code,Name
亜草 太郎32,パンダーサン
亜草 太郎3223,2
"""
)

df_orig_data = pd.read_csv(StringData, sep=",")

Expected column length is 15 for both the columns. Now when I do this:

print(
df_orig_data["agency_code"]
.astype(str)
.str.pad(width=15, side="right", fillchar="0")

)

I get:

0    亜草 太郎3200000000
1    亜草 太郎3223000000

Actually it treats the double byte characters as single character and pads zeroes.

Actually what I need is:

0    亜草 太郎320000
1    亜草 太郎322300


 亜草 太郎32 - 11 chars (4 double bytes and 3 single byte) + 4 zeroes = 亜草 太郎320000
 亜草 太郎3223 - 13 chars (4 double bytes and 5 single byte) + 2 zeroes = 亜草 太郎322300

Issue:

I am not sure how to treat these Japanese characters along with the normal Alphabets/Numbers while padding the values.

  • 1
    Like most padding functions, this is counting the number of characters, not the expected display width. Generally speaking, you could expect CJK characters to have a display width equal to that of two non-CJK characters, but in more nitty-gritty detail, the problem is more complex and nuanced than that. See also e.g. https://stackoverflow.com/questions/30881811/how-do-you-get-the-display-width-of-combined-unicode-characters-in-python-3 and https://stackoverflow.com/questions/22225441/display-width-of-unicode-strings-in-python – tripleee Nov 16 '22 at 10:32
  • https://pypi.org/project/wcwidth/ attempts to solve the problem, but I can't comment on its accuracy. – tripleee Nov 16 '22 at 10:34
  • `print(len('亜草 太郎32'),len('亜草 太郎3223'))` returns `7 9` (not your wrongly assumed `11 13`)! – JosefZ Nov 16 '22 at 12:48
  • @tripleee Thank you so much for the information. I will try something with encoding and decoding the Japanese chars to see if that helps. – Curious_Insider Nov 22 '22 at 07:29
  • @JosefZ Yes printing those gives 7 and 9 respectively. I want it to treat those characters as double bytes and consider the length as 11 and 13 so that the padding function pads less characters. – Curious_Insider Nov 22 '22 at 07:32

0 Answers0