I'm writing a program that scrapes blog posts from a number of web sites. I'm trying to extract their Australian formatted phone numbers from free text. This has proven to be fairly difficult.
Here are a few constructed blog post examples:
Example 1:
"Hello, my name is Alicia I'm 32 and have lived in Brisbane for the past 40 years. I'm 6" tall and an agile runner. Since 2004 I have been running for 2-3 times per week. Feel free to call +61 (04) 654 456 or try my other number 0434 43 22 34."
From this blog post I need to extract "04654456" and "0434432234"
Example 2:
"I'm Joe and also love running. Standing 7" feet tall and have been going at it since 2004. For training advice pls call 043 572-6087 or (02) 1232 23 56."
From this blog post I need to extract "0435726087 and "0212322356".
Example 3:
"My name is Pricilla and I love running. You can reach me on 0 434 45 45 12, but don't call before 12 am pls (I got clients up until 10-11-ish). My license number is 4335TE33 and I drive a 2004 Ford Bronco with brand new 6" tires. I can run 28 km, but usually require a break every 3 or 4 km. Call me today (04) 3 445 4512"
From this blog post I need to extract "0434454512".
I have come up with quite an elaborate system that for each blog entry does the following:
1) Strip away all non numeric characters, trims and remove double spaces
2) Converts the string to an array. So now we just have an array of numbers e.g ['0', '434', '45', '45, '12', '4335', '33', '2004', '6', '28', '3', '4', '04', '34', '832', '234]
3) Iterate through the array of numbers and apply rules to piece it together. This code is bloated and not very pretty.
4) Validate the result using a RegExp pattern for Australian mobile and land line numbers
Obviously I have tried with regular expressions, but they fail big time in this case.
My system works most of the time, but the code is not pretty to say the least.
How would you attack this?