5

How can I convert Ab9876543210 into Ab9876543210? Is there a solution by regular expression?

test <- dput("Ab9876543210")

TylerH
  • 20,799
  • 66
  • 75
  • 101
swchen
  • 643
  • 2
  • 8
  • 24
  • 1
    Not sure what exactly you meant. Both the strings look the same except the format – akrun Aug 31 '17 at 03:37
  • 1
    I see this question discussed before for java [here](https://stackoverflow.com/questions/29508932/how-check-if-string-has-full-width-character-in-java), but I've never come across this in R. Can you provide your full width string using `dput`? If I copy & paste into my console, it's the same as normal text... – Z.Lin Aug 31 '17 at 03:46
  • @Z.Lin I change my code, can you see the full-width string? – swchen Aug 31 '17 at 03:54
  • I was able to semi-reproduce your problem, by copying the fullwidth string to a text file, save it with UTF-8 encoding, then load it into R with encoding specified as such. Copying it directly into R doesn't work, nor does typing full width text (I switched to pinyin mode & toggled on fullwidth). I'll try to see what I can come up with. – Z.Lin Aug 31 '17 at 04:15
  • For anyone else interested to try this, the dput result of the full width string on my machine is "Ab9876543210" – Z.Lin Aug 31 '17 at 04:15

2 Answers2

2

Disclaimer: The following works on my machine, but since I can't replicate your full width string based purely on the example provided, this is a best guess based on my version of the problem (pasting the string into a text file, save it with UTF-8 encoding, & loading it in with coding specified as UTF-8.

Step 1. Reading in the text (I added a half width version for comparison):

> test <- readLines("fullwidth.txt", encoding = "UTF-8")
> test
[1] "Ab9876543210" "Ab9876543210"

Step 2. Verifying that the full & half width versions are not equal:

# using all.equal()
test1 <- test[1]
test2 <- test[2]
> all.equal(test1, test2)
[1] "1 string mismatch"

# compare raw bytes
> charToRaw(test1)
 [1] ef bb bf ef bc a1 62 ef bc 99 ef bc 98 ef bc 97 ef bc 96 ef bc 95 ef
[24] bc 94 ef bc 93 ef bc 92 ef bc 91 ef bc 90
> charToRaw(test2)
 [1] 41 62 39 38 37 36 35 34 33 32 31 30

For anyone interested, if you paste the raw byte version into a utf-8 decoder as hexadecimal input, you'll see that except for letter b (mapped from 62 in the 7th byte), the rest of the letters were formed by 3-byte sequences. In addition, the first 3-byte sequence maps to "ZERO WIDTH NO-BREAK SPACE character", so it's not visible when you print the string to console.

Step 3. Converting from full width to half width using the Nippon package:

library(Nippon)
test1.converted <- zen2han(test1)

> test1.converted
[1] "Ab9876543210"

# If you want to compare against the original test2 string, remove the zero 
# width character in front
> all.equal(substring(test1.converted, 2), test2)
[1] TRUE
Z.Lin
  • 28,055
  • 6
  • 54
  • 94
2

Here is a base R solution

Full width characters are in the range 0xFF01:0xFFEF, and can be offset like this.

x <- "Ab9876543210"
iconv(x, to = "utf8") |>
  utf8ToInt() |>
  (\(.) ifelse(. > 0xFF01 & . <= 0xFFEF, . - 65248, .))() |>
  intToUtf8()

[1] "Ab9876543210"
TylerH
  • 20,799
  • 66
  • 75
  • 101
Donald Seinen
  • 4,179
  • 5
  • 15
  • 40