trouble reading Japanese characters with full-width characters in PHP

Question

I have a PHP program that reads a certain FILE from an INVENTORY SCANNER. The data is streamed in 1 line like this.

3701804901070125616シャルダン　ステキプラスクルマ専用　ジャスミンマリアｼｬﾙﾀﾞﾝ ｽﾃｷﾌﾟﾗｽｸﾙﾏ ｼﾞ2131970080                 16033001383701804902720123549森永ＩＱサポート　もも＆りんご　１２５ｍｌ×３　    ﾓﾘﾅｶﾞIQｻﾎﾟｰﾄ ﾓﾓ&ﾘﾝｺﾞ1901030080                 16033001383701804987072042557噛むブレスケア　マスカット　２５粒　                ﾌﾞﾚｽｹｱｶﾑ 25T ﾏｽｶｯﾄ  2121070080                 1603300138

These can be separated by every 128 bytes. Like this

Inventory Scanner

Here is the hex of the string with the width of 256 bytes

0 : 33 37 30 31 38 30 34 39 30 31 30 37 30 31 32 35 36 31 36 e3 82 b7 e3 83 a3 e3 83 ab e3 83 80 e3 83 b3 e3 80 80 e3 82 b9 e3 83 86 e3 82 ad e3 83 97 e3 83 a9 e3 82 b9 e3 82 af e3 83 ab e3 83 9e e5 b0 82 e7 94 a8 e3 80 80 e3 82 b8 e3 83 a3 e3 82 b9 e3 83 9f e3 83 b3 e3 83 9e e3 83 aa e3 82 a2 ef bd bc ef bd ac ef be 99 ef be 80 ef be 9e ef be 9d 20 ef bd bd ef be 83 ef bd b7 ef be 8c ef be 9f ef be 97 ef bd bd ef bd b8 ef be 99 ef be 8f 20 ef bd bc ef be 9e 32 31 33 31 39 37 30 30 38 30 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 31 36 30 33 33 30 30 31 33 38 33 37 30 31 38 30 34 39 30 32 37 32 30 31 32 33 35 34 39 e6 a3 ae e6 b0 b8 ef bc a9 ef bc b1 e3 82 b5 e3 83 9d e3 83 bc e3 83 88 e3 80 80 e3 82 82 e3 82 82 ef bc 86 e3 82 8a e3 82 93 e3 81 94 e3 80 [3701804901070125616................................................................................................ .............................. ......2131970080 16033001383701804902720123549...............................................] 100 : 80 ef bc 91 ef bc 92 ef bc 95 ef bd 8d ef bd 8c c3 97 ef bc 93 e3 80 80 20 20 20 20 ef be 93 ef be 98 ef be 85 ef bd b6 ef be 9e 49 51 ef bd bb ef be 8e ef be 9f ef bd b0 ef be 84 20 ef be 93 ef be 93 26 ef be 98 ef be 9d ef bd ba ef be 9e 31 39 30 31 30 33 30 30 38 30 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 31 36 30 33 33 30 30 31 33 38 33 37 30 31 38 30 34 39 38 37 30 37 32 30 34 32 35 35 37 e5 99 9b e3 82 80 e3 83 96 e3 83 ac e3 82 b9 e3 82 b1 e3 82 a2 e3 80 80 e3 83 9e e3 82 b9 e3 82 ab e3 83 83 e3 83 88 e3 80 80 ef bc 92 ef bc 95 e7 b2 92 e3 80 80 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 ef be 8c ef be 9e ef be 9a ef bd bd ef bd b9 ef bd b1 ef bd b6 ef be 91 20 32 35 54 20 ef be 8f ef bd bd ef bd b6 ef bd af ef be 84 20 20 32 31 32 31 [........................ ...............IQ............... ......&............1901030080 16033001383701804987072042557...................................................... ........................ 25T ............... 2121] 200 : 30 37 30 30 38 30 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 31 36 30 33 33 30 30 31 33 38 [070080 1603300138]

SO here is my PHP that reads the file.

    $fp = fopen($vanReadfile, "r");
    
    flock($fp, LOCK_SH);    
    flock($fp, LOCK_UN);
    $vandataBuf= fgets($fp); //fgets since only 1 continously line
    $convertBufstring = mb_convert_encoding($vandataBuf, "UTF-8","Shift-JIS"); 

                $tenpo_cd       =mb_substr($convertBufstring,$i+0,3,"UTF-8");
                $chiku_cd       =mb_substr($convertBufstring,$i+3,2,"UTF-8");
                $shori_kbn      =mb_substr($convertBufstring,$i+5,1,"UTF-8");
                $jan_cd         =mb_substr($convertBufstring,$i+6,13,"UTF-8");

                $prod_nm        =mb_substr($convertBufstring,$i+19,52,"UTF-8");   
                $prod_kn        =mb_substr($convertBufstring,$i+71,20,"UTF-8");

                $jicfs_class_cd =mb_substr($convertBufstring,$i+91,6,"UTF-8");
                $prod_tax       =mb_substr($convertBufstring,$i+97,3,"UTF-8");
                $regi_duty_kbn  =mb_substr($convertBufstring,$i+100,1,"UTF-8");
                $auto_order_kbn =mb_substr($convertBufstring,$i+101,1,"UTF-8");
                $spacex16       =mb_substr($convertBufstring,$i+102,16,"UTF-8");

The output of the file looks like this using the code above.

TENPO: 370

JAN CD: 4901070125616

PROD NAME: シャルダン　ステキプラスクルマ専用　ジャスミンマリアｼｬﾙﾀﾞﾝ ｽﾃｷﾌﾟﾗｽｸﾙﾏ ｼﾞ213197

PROD NAME KN: 0080

JICFS CLASS CD: 16033

REGI DUTY KBN: 3

AUTO ORDER KBN: 8

SPACE 3701804902720123

The output I want should look something like this:

TENPO: 370

JAN CD: 4901070125616

**PROD NAME**: シャルダン　ステキプラスクルマ専用　ジャスミンマリア

**PROD NAME KN**: ｼｬﾙﾀﾞﾝ ｽﾃｷﾌﾟﾗｽｸﾙﾏ ｼﾞ

JICFS CLASS CD: 213197

REGI DUTY KBN: 3

AUTO ORDER KBN: 8

SPACE (16 white spaces here)

All other substr() are correct. The problem is when reading the PRODUCT NAME(FULL-width). The program reads sometimes less or more.

here is an example:

シャルダン　ステキプラスクルマ専用　ジャスミンマリア   
<--- this is 26 characters(72bytes)

森永ＩＱサポート　もも＆りんご　１２５ｍｌ×３　       
<--- this is 28 characters(200bytes)

With this, I am already out of ideas on how to deal with this. Can anyone suggest a solution to this?

I tried using

mb_convert_encoding($vandataBuf, "UTF-8","Shift-JIS");

and it didn't do the job.

I also tried adjusting how many characters this code reads and changed it to 26 characters. It works in the first line since its 26 characters but it reads the 2nd line wrong which has 28 characters

$prod_nm        =mb_substr($convertBufstring,$i+19,26,"UTF-8");

I also tried converting all full-width characters to half-width but the number of characters also changes so its not consistent.

$convertBufstring = mb_convert_kana($convertBufstringBEFORE, "KansC");

$prod_nm        =mb_strcut($convertBufstring,$i+19,52,"UTF-8");

I am already out of ideas. Can anyone suggest something? Maybe I missed something.

The `mb_` functions are multi-byte aware, but it is possible your data is just on standard byte boundaries, so maybe try `substr` instead of `mb_substr`, and then convert those bytes. — Chris Haas, Feb 26 '23 at 14:38
i tried using `substr` and it showed as blank. So `mb_substr` should be correct I think. — K a r L, Feb 26 '23 at 23:49
To properly debug this issue we need [the actual bytes](https://stackoverflow.com/a/4225813/4299358), not pre-interpreted text. Output it byte wise, then we can judge at which step wrong assumptions are made. Try to break it down to an example with 256 bytes and edit the output into your question. Avoid pictures. — AmigoJack, Feb 27 '23 at 11:21
@AmigoJack I ran the code from your link and I displayed a part of the String I am trying to read and separate into chunks — K a r L, Feb 27 '23 at 13:50

score 0 · Answer 1 · answered Feb 27 '23 at 16:46

Let me put that unlucky formatted output into a hex dump that is uninterpreted by HTML (not losing multiple spaces) and cut aligned to 16 bytes; I've inserted slashes to mark where line parts begin and end:

position 0x0 (256 bytes follow):
33 37 30/31 38/30/34 39  30 31 30 37 30 31 32 35  [3701804901070125]
36 31 36/e3 82 b7 e3 83  a3 e3 83 ab e3 83 80 e3  [616.............] prod_nm = 26 chars in 78 bytes
83 b3 e3 80 80 e3 82 b9  e3 83 86 e3 82 ad e3 83  [................]
97 e3 83 a9 e3 82 b9 e3  82 af e3 83 ab e3 83 9e  [................]
e5 b0 82 e7 94 a8 e3 80  80 e3 82 b8 e3 83 a3 e3  [................]
82 b9 e3 83 9f e3 83 b3  e3 83 9e e3 83 aa e3 82  [................]
a2/ef bd bc ef bd ac ef  be 99 ef be 80 ef be 9e  [................] prod_kn = 20 chars
ef be 9d 20 ef bd bd ef  be 83 ef bd b7 ef be 8c  [... ............]
ef be 9f ef be 97 ef bd  bd ef bd b8 ef be 99 ef  [................]
be 8f 20 ef bd bc ef be  9e/32 31 33 31 39 37/30  [.. ......2131970]
30 38/30/20/20 20 20 20  20 20 20 20 20 20 20 20  [080             ]
20 20 20 20/31 36 30 33  33 30 30 31 33 38|33 37  [    160330013837]
30/31 38/30/34 39 30 32  37 32 30 31 32 33 35 34  [0180490272012354]
39/e6 a3 ae e6 b0 b8 ef  bc a9 ef bc b1 e3 82 b5  [9...............] prod_nm = 24 chars in 71 bytes plus 4 spaces
e3 83 9d e3 83 bc e3 83  88 e3 80 80 e3 82 82 e3  [................]
82 82 ef bc 86 e3 82 8a  e3 82 93 e3 81 94 e3 80  [................]

position 0x100 (256 bytes follow):
80 ef bc 91 ef bc 92 ef  bc 95 ef bd 8d ef bd 8c  [................]
c3 97 ef bc 93 e3 80 80  20 20 20 20/ef be 93 ef  [........    ....] prod_kn = 20 chars
be 98 ef be 85 ef bd b6  ef be 9e 49 51 ef bd bb  [...........IQ...]
ef be 8e ef be 9f ef bd  b0 ef be 84 20 ef be 93  [............ ...]
ef be 93 26 ef be 98 ef  be 9d ef bd ba ef be 9e/ [...&............]
31 39 30 31 30 33/30 30  38/30/20/20 20 20 20 20  [1901030080      ]
20 20 20 20 20 20 20 20  20 20 20/31 36 30 33 33  [           16033]
30 30 31 33 38|33 37 30/ 31 38/30/34 39 38 37 30  [0013837018049870]
37 32 30 34 32 35 35 37/ e5 99 9b e3 82 80 e3 83  [72042557........] prod_nm = 18 chars in 54 bytes plus 16 spaces
96 e3 83 ac e3 82 b9 e3  82 b1 e3 82 a2 e3 80 80  [................]
e3 83 9e e3 82 b9 e3 82  ab e3 83 83 e3 83 88 e3  [................]
80 80 ef bc 92 ef bc 95  e7 b2 92 e3 80 80 20 20  [..............  ]
20 20 20 20 20 20 20 20  20 20 20 20 20 20/ef be  [              ..] prod_kn = 20 chars
8c ef be 9e ef be 9a ef  bd bd ef bd b9 ef bd b1  [................]
ef bd b6 ef be 91 20 32  35 54 20 ef be 8f ef bd  [...... 25T .....]
bd ef bd b6 ef bd af ef  be 84 20 20/32 31 32 31  [..........  2121]

position 0x200 (33 bytes follow):
30 37/30 30 38/30/20/20  20 20 20 20 20 20 20 20  [070080          ]
20 20 20 20 20 20 20/31  36 30 33 33 30 30 31 33  [       160330013]
38                                                [8]

What does it tell us?

128 characters need more than 128 bytes: since all your data lines seem to start with 37018049 and end with 1603300138 we can see that
- the first line needs 11*16 + 14 = 190 bytes (effectivly 102 characters),
- the second 2 + 11*16 + 5 = 183 bytes (effectively 104 characters) and
- the third 11 + 10*16 + 1 = 172 bytes (effectively 110 characters).
e3 82 b7 and e3 83 a3 and e3 83 ab is clearly UTF-8, needing 3 bytes for Katakanas (シ, ャ and ル).
ef bd bc and ef be 9e is also clearly UTF-8 (ｼ and ﾞ), but this times it's the halfwidth characters. And so is ef bc 92 and ef bc 95 (２ and ５). So we have no issue any encoding.
Multiple variantes of spaces are used: e3 80 80 (idiographic space, CJK) versus 20 (regular/ASCII space) - that's interesting.
The product number needs
- In the first line 26 chars in 78 bytes - that's 3 bytes per character, just as expected.
- In the second line 24 chars in 71 bytes (those are 23 characters with 3 bytes and 1 character with 2 bytes) plus a padding of 4 (ASCII) spaces.
- In the third line 18 chars in 54 bytes (3 bytes per character) plus a padding of 16 spaces.
Pattern wise it looks like if there are less than 26 characters used, the rest is stuffed with the double amount of (ASCII) spaces needed:
- first line needs none
- second needs actually only 2 spaces (instead of 4)
- third needs actually only 8 spaces (instead of 16).
That's clearly not your fault, but someone else is producing it wrongly. Also take note that idiographic CJK spaces are not the issue, only space 20 is.
Your graphic/screenshot is misleading: just because we have non-halfwidth Katakanas/Kanjis doesn't mean each grapheme consists of 2 characters - they are still 1 character each, not meaning/measuring "2" in any unit. That's also why you wrongly attempt to read 52 characters instead of only 26.

You could adapt to it: read 26 characters, then count backwards how many spaces (20) are at the end, to then expect the same amount of spaces to read on, f.e.:

$vandataBuf= fgets($fp);

// Converting from Shift-JIS to UTF-8 is not needed, when with the following
// functions you could also use "Shift-JIS" as parameter instead of "UTF-8".
// However, I'll leave it to your old way:
$convertBufstring= mb_convert_encoding( $vandataBuf, "UTF-8", "Shift-JIS" );

// Read 26 characters "product number"
$prod_nm       = mb_substr( $convertBufstring, $i+ 19, 26, "UTF-8" );  // "森永ＩＱサポート　もも＆りんご　１２５ｍｌ×３　  "

// Count spaces from the end
$spaces= 0;
$position= 26;
while( $position> 0  // Never beyond the beginning
    && mb_substr( $prod_num, $position- 1, 1 )== " "  // As long as it is an ASCII space
     ) {
  $spaces++;
  $position--;
}

// Now increase your offset "i" by the same amount of found spaces, because
// the spaces occur always twice as often.
$i+= $spaces;  // Should be 0 for the first line as per example, 2 for second, 8 for third

// Read on...
$prod_kn       = mb_substr( $convertBufstring, $i+ 45, 20, "UTF-8" );  // "ﾓﾘﾅｶﾞIQｻﾎﾟｰﾄ ﾓﾓ&ﾘﾝｺﾞ"
$jicfs_class_cd= mb_substr( $convertBufstring, $i+ 65,  6, "UTF-8" );  // "190103"
$prod_tax      = mb_substr( $convertBufstring, $i+ 71,  3, "UTF-8" );  // "008"
$regi_duty_kbn = mb_substr( $convertBufstring, $i+ 74,  1, "UTF-8" );  // "0"
$auto_order_kbn= mb_substr( $convertBufstring, $i+ 75,  1, "UTF-8" );  // " "
$spacex16      = mb_substr( $convertBufstring, $i+ 76, 16, "UTF-8" );  // "                "
$ending        = mb_substr( $convertBufstring, $i+ 92, 10, "UTF-8" );  // "1603300138"
// ...effectively always expecting 102 characters, not 128 (26 less).

However, untested. But I'm very sure this kludge works.

` && mb_substr( $prod_nm, $position- 1, 1 )== " " ` changed this to prod_nm (product name) — K a r L, Feb 28 '23 at 01:06
Thank you for the detailed analysis and with the code. I tried implementing this to my code. and output to an HTML in browser. It works on the 1st line and on the 2nd line it breaks down i forgot to mentiion that this is inside a for-loop. I am using strlen `$totalCountstr = strlen($convertBufstring); ` `$width = 102;` `for ($i = 0; $i <= $totalCountstr; $i += $width) {` — K a r L, Feb 28 '23 at 01:11
Then edit your question to include **all** necessary code if you're unable to adapt your loop with `i` accordingly. I thought it was obvious enough on how to treat `i` with all the comments I made in my code. — AmigoJack, Feb 28 '23 at 07:28

trouble reading Japanese characters with full-width characters in PHP

1 Answers1