Extracting phone numbers from free text

Question

I'm writing a program that scrapes blog posts from a number of web sites. I'm trying to extract their Australian formatted phone numbers from free text. This has proven to be fairly difficult.

Here are a few constructed blog post examples:

Example 1:

"Hello, my name is Alicia I'm 32 and have lived in Brisbane for the past 40 years. I'm 6" tall and an agile runner. Since 2004 I have been running for 2-3 times per week. Feel free to call +61 (04) 654 456 or try my other number 0434 43 22 34."

From this blog post I need to extract "04654456" and "0434432234"

Example 2:

"I'm Joe and also love running. Standing 7" feet tall and have been going at it since 2004. For training advice pls call 043 572-6087 or (02) 1232 23 56."

From this blog post I need to extract "0435726087 and "0212322356".

Example 3:

"My name is Pricilla and I love running. You can reach me on 0 434 45 45 12, but don't call before 12 am pls (I got clients up until 10-11-ish). My license number is 4335TE33 and I drive a 2004 Ford Bronco with brand new 6" tires. I can run 28 km, but usually require a break every 3 or 4 km. Call me today (04) 3 445 4512"

From this blog post I need to extract "0434454512".

I have come up with quite an elaborate system that for each blog entry does the following:

1) Strip away all non numeric characters, trims and remove double spaces

2) Converts the string to an array. So now we just have an array of numbers e.g ['0', '434', '45', '45, '12', '4335', '33', '2004', '6', '28', '3', '4', '04', '34', '832', '234]

3) Iterate through the array of numbers and apply rules to piece it together. This code is bloated and not very pretty.

4) Validate the result using a RegExp pattern for Australian mobile and land line numbers

Obviously I have tried with regular expressions, but they fail big time in this case.

My system works most of the time, but the code is not pretty to say the least.

How would you attack this?

A.D · Accepted Answer · 2015-08-04T03:42:28.887

What you are looking for is actually a research area in Natural Language Processing known are entity extraction. There are many approaches to the problem and several mathematical models to solve such tasks, fortunately there are toolkits available that do similar tasks -OpenNLP and Stanford NER are couple of examples. It has tools to automatically extract Names, Dates, Parts of Speech etc. You might be able to modify it to extract phone numbers - one thing to know is that these are statistical models (as oppose to rule based which is your current approach) so you would need training data.

Note that this might require significant changes to what you are currently doing so it may or may not be worth it, but if you are going to be working on such problems related to entity extraction from unstructured text it might be worth knowing about these tools.

I would start by looking into OpenNLP/Stanford documentation to see if what you are looking for is possible.

well, this is a programming question, but the NLP you pointed out could be valuable to OP. — Raptor, Aug 04 '15 at 03:40

score 0 · Answer 2 · answered Aug 04 '15 at 03:32

0

I'd use a simpler approach:

Remove spaces, commas, parentheses and any other symbol you can.
use regex to match X digits in a row that match the Australian phone numbers length.

answered Aug 04 '15 at 03:32

Ibu

42,752
13
76
103

Ibu, that's mostly what I'm doing now. As said, it works but not an ideal approach. – ChrisRich Aug 04 '15 at 03:53
it's not working for you? is it failing? this approach will work for all 3 example you showed. – Ibu Aug 04 '15 at 06:43

score 0 · Answer 3 · answered Aug 04 '15 at 03:56

0

I would go with regexp because sometimes you got wrong numbers if you only use all digits:

+49 (0) 7121 / 1229-276

That should read as local 071211229276 or international as 004971211229276.

answered Aug 04 '15 at 03:56

Alinex

914
8
18

Show me a RegExp that can cope with my different examples and the endless inconceivable ways users write phone numbers. Because I have not been able to either find or build one my self! Currently in my code, I simply disregard international dialer codes. – ChrisRich Aug 04 '15 at 04:00

Extracting phone numbers from free text

3 Answers3