I'm trying to identify numbers and corresponding magnitude in a text. I run into the following error:
UNABLE TO PARSE MAGNITUDE: 6,700
Here's a code snippet from a larger code to help you understand what I'm doing.
for(Quantity quantity: originalQuantities){
y = Math.round(quantity.getMagnitude());
if (( roleStrings.get(SemanticRole.TIME) != null && (roleStrings.get(SemanticRole.TIME)).contains(String.valueOf(y))))
continue;
.........................
Quantity here is a class with the following definition:
public class Quantity
{
private Float magnitude;
private String multiplier;
private String unit;
private UnitType type;
private Float absoluteMagnitude;
enum UnitType
{
TIME, MONEY, WEIGHT, VOLUME, NUMBER
}
public Quantity(String strMagnitude, String multiplier, String unit,
String strType)
{
this.setMagnitude(strMagnitude);
this.multiplier = multiplier;
this.unit = unit;
this.setType(strType);
}
public Float getMagnitude()
{
return magnitude;
}
public String getMultiplier()
{
return multiplier;
}
public String getUnit()
{
return unit;
}
public UnitType getType()
{
return type;
}
How do I solve this? I tried using Locale and ParseFloat and other transformations but couldn't fix the issue.
Here is the code which parses magnitude:
public static List<Quantity> getQuantitiesFromString(String str) throws ParseException
{
List<Quantity> quantities = new ArrayList<Quantity>();
//final String REGEX = "^(\\+|-)?([1-9]\\d{0,2}|0)?(,\\d{3}){0,}(\\.\\d+)?";
//NumberFormat numberFormat = NumberFormat.getNumberInstance(Locale.US);
//String numberAsString = numberFormat.format(number);
// optional +/- sign followed by numbers separated with a decimal
Pattern pattern = Pattern.compile("^[-+]?[0-9]*\\.?[0-9]+");
Pattern pattern1 = Pattern.compile("^[0-9][0-9,-]*-[0-9,-]*[0-9]");
List<String> tokens = Arrays.asList(str.split(" "));
for (int i = 0; i < tokens.size(); i++)
{
String magnitude = "";
String multiplier = "";
String unit = "";
String type = "";
boolean numFound = false;
String token = tokens.get(i);
// append all numbers matching pattern into a String
Matcher matcher = pattern.matcher(token);
Matcher matcher1 = pattern1.matcher(token);
while (matcher.find())
{
numFound = true;
magnitude += matcher.group();
}
//ignore for number ranges (e.g. 0-10)
while (matcher1.find())
{
numFound = false;
continue;
}
if (numFound)
{
// loop through all words starting from current word
// keep adding valid unit words until an invalid unit word is
// encountered
for (int j = i; j < tokens.size(); j++)
{
// strip non-alphabetic chars from word
String word = tokens.get(j).replaceAll("[^a-zA-Z$%]", "")
.toLowerCase();
// see if the stripped word is a unit
boolean validUnitWord = false;
if (getUnitTypesMap().keySet().contains(word))
{
validUnitWord = true;
if (getUnitTypesMap().get(word).equalsIgnoreCase(
"number"))
{
multiplier += multiplier.isEmpty() ? word : " "
+ word;
}
else
{
unit += unit.isEmpty() ? word : " " + word;
type = getUnitTypesMap().get(word);
}
}
// break if invalid unit word; else keep searching in next
// words
// except for current word (index = i), in which case keep
// searching regardless
if (!validUnitWord && j != i)
break;
}
quantities.add(new Quantity(magnitude, multiplier, unit, type));
}
}
return quantities;
}
EDIT
The Unable to parse magnitude error was when I was playing around with Locale.US
I reverted to older code and now for a string like:
debentures amounting to Rs 6,700 crore
the output I get from the getQuantitiesFromString is:
QUANTITY: [[magnitude=6.0, multiplier=crore, unit=, type=NUMBER, absoluteMagnitude=null]]
Everything after the comma is being ignored. I tried this regex to detect numbers like 22,00.15 22,000,353 etc.:
"^(\+|-)?([1-9]\d{0,2}|0)?(,\d{3}){0,}(\.\d+)?"
But for some reason it doesn't work for my code.