3

I have the following vector and I want to have the subscript numbers (e.g. ₆, ₂) to be replaced with 'normal' numbers.

vec = c("C₆H₄ClNO₂", "C₆H₆N₂O₂", "C₆H₅NO₃", "C₉H₁₀O₂", "C₈H₈O₃")

I could lookup all subscript values and replace them individually:

gsub('₆', '6', vec)

But isn't there a pattern in regex for it?

There's a similar question for javascript but I couldn't translate it into R.

smci
  • 32,567
  • 20
  • 113
  • 146
andschar
  • 3,504
  • 2
  • 27
  • 35
  • 5
    `chartr("₀₁₂₃₄₅₆₇₈₉", "0123456789", vec)` – Wiktor Stribiżew Sep 19 '19 at 08:54
  • Possible duplicate of [Replace multiple strings in one gsub() or chartr() statement in R?](https://stackoverflow.com/questions/33949945/replace-multiple-strings-in-one-gsub-or-chartr-statement-in-r) – Wiktor Stribiżew Sep 19 '19 at 08:54
  • I don't know anything about R, could you do something like in C, where you "add" the difference in the ascii table to a letter ? i.e, assume the subscript are next to each other in the ascii table, substract `1` and `₁` to know the delta, and apply it to each number. Otherwise I'd just make a map (₁ -> 1, etc) – LogicalKip Sep 19 '19 at 08:55
  • 2
    @WiktorStribiżew it's not really a duplicate imho, since I'm asking for a pattern in regex for sub/superscripts. But yes, this would be one possibility. – andschar Sep 19 '19 at 08:57
  • 3
    You need no regex here. `chartr` is from base R, use it here. – Wiktor Stribiżew Sep 19 '19 at 08:59
  • 1
    Possible duplicate of [Using multiple gsubs in one r function](https://stackoverflow.com/questions/35318530/using-multiple-gsubs-in-one-r-function) – akrun Sep 19 '19 at 17:51
  • 1
    duplicate of https://stackoverflow.com/q/6954017/680068 – zx8754 Sep 25 '19 at 07:04
  • 4
    This question is in my opinion significantly different from the proposed duplicate, which is why I voted to undelete and reopen. Besides, `chartr` is way too underappreciated and deserves more of our love. – Roman Luštrik Sep 25 '19 at 07:04

2 Answers2

5

Use chartr:

Translate characters in character vectors

Solution:

chartr("₀₁₂₃₄₅₆₇₈₉", "0123456789", vec)

See the online R demo

BONUS

To normalize superscript digits use

chartr("⁰¹²³⁴⁵⁶⁷⁸⁹", "0123456789", "⁰¹²³⁴⁵⁶⁷⁸⁹")
## => [1] "0123456789"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
3

We can use str_replace_all from stringr to extract all the subscript numbers, convert it to equivalent integer subtract 8272 (because that is the difference between integer value of and 6 and all other equivalents) and convert it back.

stringr::str_replace_all(vec, "\\p{No}", function(m) intToUtf8(utf8ToInt(m) - 8272))
#[1] "C6H4ClNO2" "C6H6N2O2"  "C6H5NO3"   "C9H10O2"   "C8H8O3" 

As pointed out by @Wiktor Stribiżew "\\p{No}" matches more than subscript digits to only match subscripts from 0-9 we can use (thanks to @thothal )

str_replace_all(vec, "[\U2080-\U2089]", function(m) intToUtf8(utf8ToInt(m) - 8272))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213