64

I need to modify strings similar to "¼ cups of sugar" to "cups of sugar", meaning replacing all fraction symbols with "".

I have referred to this post and managed to remove ¼ using this line:

itemName = itemName.replaceAll("\u00BC", "");

but how do I replace every possible fraction symbol out there?

Community
  • 1
  • 1
Michelle
  • 601
  • 5
  • 9
  • what about removing all non alphanumeric character except space: using: itemName.replaceAll("[^A-Za-z0-9 ]", ""); – Fady Saad Apr 12 '17 at 02:53
  • 2
    Java is not Android – Ungeheuer Apr 12 '17 at 02:57
  • 1
    @Ungeheuer got it. tag removed. – Michelle Apr 12 '17 at 03:03
  • 19
    Perhaps I spend too long on cooking.se but I wonder *why* you're doing this (as opposed to replacing "¼ cups of sugar" with " 1/4 cups of sugar"). – Chris H Apr 12 '17 at 10:35
  • 6
    May I ask why you would want to completely remove things that will change the semantic meaning of the string? I'm curious. – Matti Virkkunen Apr 12 '17 at 13:45
  • 2
    @ChrisH and Matti - I'm building an app for recipes and shopping lists - and I'm using an API which returns a JSON with ingredients combined with their quantity needed. I am still keeping the original string, but giving the user an option to see items grouped by their 'clean names' (so they only see one item) instead of seeing 5 rows of different quantities of garlic. Did I explain that right? Sorry, I'm a total novice. – Michelle Apr 12 '17 at 23:12
  • @Michelle that sounds reasonable if tricky to get just right (I could imagine a recipe calling for "1 cup of sugar" as well as "sugar (for dusting)" so the grouping could be a challenge. Good luck – Chris H Apr 13 '17 at 05:54
  • 2
    If it's for a cooking app I'd suggest just hard coding the replacements for a limited number of fractions, maybe 1/2 to 1/10. I've never seen a recipe which called for 1/1076... – Ian Newson Apr 19 '17 at 18:16

2 Answers2

97

Fraction symbols like ¼ and ½ belong to Unicode Category Number, Other [No]. If you are ok with eliminating all 676 characters in that group, you can use the following regular expression:

itemName = itemName.replaceAll("\\p{No}+", "");

If not, you can always list them explicitly:

// As characters (requires UTF-8 source file encoding)
itemName = itemName.replaceAll("[¼½¾⅐⅑⅒⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞↉]+", "");

// As ranges using unicode escapes
itemName = itemName.replaceAll("[\u00BC-\u00BE\u2150-\u215E\u2189]+", "");
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • 2
    Note that fonts may render _any_ sequence like 23/12 as fractions, thus enabling any fraction to be shown like that, not just the pre-composed ones. If that happens you may need to remove a lot more than just a list of characters. – Joey Apr 12 '17 at 06:16
  • Why the + in the regex'es ? Can't you just simply leave it out or does it do anything for efficiency ? – HopefullyHelpful Apr 12 '17 at 10:01
  • 1
    @HopefullyHelpful In this case the `+` operator causes the character set (`[...]`) to repeat multiple times. See this answer for more details: http://stackoverflow.com/a/3850256/3088508 – Ethan Apr 12 '17 at 11:39
  • 6
    @HopefullyHelpful yes they aren't necessary, and yes they should improve efficiency. One should probably not draw conclusions from it, but if you add a `+` at the end of the expression in [this regex101 sample](https://regex101.com/r/mwDnAG/1) execution time will go down from 1 to 0ms and the number of steps will fall from 32 to 14. On an input without any repeats [it only adds one step](https://regex101.com/r/bm80sa/2) – Aaron Apr 12 '17 at 12:24
  • 1
    @Aaron I would refute that conclusion with https://regex101.com/r/9Md35x/1, the change seems marginal and I would attribute it to the javascript implementation potentially and maybe flow prediction – HopefullyHelpful Apr 12 '17 at 12:30
  • 1
    @HopefullyHelpful heh? Testing it on my side, it seems to behave marginally better with `+`, going down from 148305 steps to 139377 and from ~375ms to ~350ms. Thanks for taking the time to make a good data set in any case ! You're right that it probably depends on regex engines specifics – Aaron Apr 12 '17 at 12:31
  • 1
    @I tested it with a larger sample and it's a 3% increase, but I would expect it to be dependant on the language and the code. Javascript is a slow scripting language so the prediction that another one might come aswell, could boost it to a larger % than for c or java. Would be interesting to test though. – HopefullyHelpful Apr 12 '17 at 12:36
2

You can use below regex to replace all fraction with empty string.

str = str.replaceAll("(([\\xbc-\\xbe])?)", "")
Sumit Gulati
  • 665
  • 4
  • 14