Removing all fraction symbols like “¼” and “½” from a string

Question

I need to modify strings similar to "¼ cups of sugar" to "cups of sugar", meaning replacing all fraction symbols with "".

I have referred to this post and managed to remove ¼ using this line:

itemName = itemName.replaceAll("\u00BC", "");

but how do I replace every possible fraction symbol out there?

what about removing all non alphanumeric character except space: using: itemName.replaceAll("[^A-Za-z0-9 ]", ""); — Fady Saad, Apr 12 '17 at 02:53
Perhaps I spend too long on cooking.se but I wonder *why* you're doing this (as opposed to replacing "¼ cups of sugar" with " 1/4 cups of sugar"). — Chris H, Apr 12 '17 at 10:35
May I ask why you would want to completely remove things that will change the semantic meaning of the string? I'm curious. — Matti Virkkunen, Apr 12 '17 at 13:45
@ChrisH and Matti - I'm building an app for recipes and shopping lists - and I'm using an API which returns a JSON with ingredients combined with their quantity needed. I am still keeping the original string, but giving the user an option to see items grouped by their 'clean names' (so they only see one item) instead of seeing 5 rows of different quantities of garlic. Did I explain that right? Sorry, I'm a total novice. — Michelle, Apr 12 '17 at 23:12
@Michelle that sounds reasonable if tricky to get just right (I could imagine a recipe calling for "1 cup of sugar" as well as "sugar (for dusting)" so the grouping could be a challenge. Good luck — Chris H, Apr 13 '17 at 05:54
If it's for a cooking app I'd suggest just hard coding the replacements for a limited number of fractions, maybe 1/2 to 1/10. I've never seen a recipe which called for 1/1076... — Ian Newson, Apr 19 '17 at 18:16

Andreas · Accepted Answer · 2017-04-12T02:56:01.087

97

Fraction symbols like ¼ and ½ belong to Unicode Category Number, Other [No]. If you are ok with eliminating all 676 characters in that group, you can use the following regular expression:

itemName = itemName.replaceAll("\\p{No}+", "");

If not, you can always list them explicitly:

// As characters (requires UTF-8 source file encoding)
itemName = itemName.replaceAll("[¼½¾⅐⅑⅒⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞↉]+", "");

// As ranges using unicode escapes
itemName = itemName.replaceAll("[\u00BC-\u00BE\u2150-\u215E\u2189]+", "");

edited Apr 12 '17 at 02:56

answered Apr 12 '17 at 02:49

Andreas

154,647
11
152
247

2

Note that fonts may render _any_ sequence like 23/12 as fractions, thus enabling any fraction to be shown like that, not just the pre-composed ones. If that happens you may need to remove a lot more than just a list of characters. – Joey Apr 12 '17 at 06:16
Why the + in the regex'es ? Can't you just simply leave it out or does it do anything for efficiency ? – HopefullyHelpful Apr 12 '17 at 10:01
1

@HopefullyHelpful In this case the `+` operator causes the character set (`[...]`) to repeat multiple times. See this answer for more details: http://stackoverflow.com/a/3850256/3088508 – Ethan Apr 12 '17 at 11:39
6

@HopefullyHelpful yes they aren't necessary, and yes they should improve efficiency. One should probably not draw conclusions from it, but if you add a `+` at the end of the expression in [this regex101 sample](https://regex101.com/r/mwDnAG/1) execution time will go down from 1 to 0ms and the number of steps will fall from 32 to 14. On an input without any repeats [it only adds one step](https://regex101.com/r/bm80sa/2) – Aaron Apr 12 '17 at 12:24
1

@Aaron I would refute that conclusion with https://regex101.com/r/9Md35x/1, the change seems marginal and I would attribute it to the javascript implementation potentially and maybe flow prediction – HopefullyHelpful Apr 12 '17 at 12:30
1

@HopefullyHelpful heh? Testing it on my side, it seems to behave marginally better with `+`, going down from 148305 steps to 139377 and from ~375ms to ~350ms. Thanks for taking the time to make a good data set in any case ! You're right that it probably depends on regex engines specifics – Aaron Apr 12 '17 at 12:31
1

@I tested it with a larger sample and it's a 3% increase, but I would expect it to be dependant on the language and the code. Javascript is a slow scripting language so the prediction that another one might come aswell, could boost it to a larger % than for c or java. Would be interesting to test though. – HopefullyHelpful Apr 12 '17 at 12:36

Sumit Gulati · Answer 2 · 2017-04-12T03:17:30.787

2

You can use below regex to replace all fraction with empty string.

str = str.replaceAll("(([\\xbc-\\xbe])?)", "")

edited Apr 12 '17 at 03:17

answered Apr 12 '17 at 02:55

Sumit Gulati

665
4
14

6

Why the additional capturing groups `()` and the optional `?` match? – MT0 Apr 12 '17 at 08:48
12

You know, just in case, you wanted to replace "" with "" – HopefullyHelpful Apr 12 '17 at 10:00

Removing all fraction symbols like “¼” and “½” from a string

2 Answers2

Linked

Related