3

How can I extract phone numbers from a text file?

x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
  "43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567", 
  "Please contact Mr. Bean (613)2134567",
  "1.575.555.5555 is his #1 number",  
  "7164347566"
)

This is a question that's been answered for other languages (see php abd general regex) but doesn't seem to have been tackled on SO for R.

I have searched and found what appears to be possible regexes to find phone numbers (In addition to the regexes from other languages above): http://regexlib.com/Search.aspx?k=phone but have not been able to use gsub within R with these to extract all of these numbers in the example.

Ideally, we'd get something like:

[[1]]
[1] "2-613-213-4567" "5555555555"    

[[2]]
[1] "613 213 4567"

[[3]]
[1] "(613)2134567"

[[4]]
[1] "1.575.555.5555"

[[5]]
[1] "7164347566"
Community
  • 1
  • 1
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519

3 Answers3

8

This is the best I've been able to do- you have a pretty wide range of formats, including some with spaces, so the regex is pretty general. It just says "look for a string of at least 5 characters made up entirely of digits, periods, brackets, hyphens or spaces":

library(stringr)
str_extract_all(x, "(^| )[0-9.() -]{5,}( |$)")

Output:

[[1]]
[1] " 2-613-213-4567 " " 5555555555 "    

[[2]]
[1] " 613 213 4567"

[[3]]
[1] " (613)2134567"

[[4]]
[1] "1.575.555.5555 "

[[5]]
[1] "7164347566"

The leading/trailing spaces could probably be fixed with some additional complexity, or you could just fix it in post.

Update: a bit of searching lead me to this answer, which I slightly modified to allow periods. A bit stricter in terms of requiring a valid (US?) phone number, but seems to cover all your examples:

str_extract_all(x, "\\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}")

Output:

[[1]]
[1] "613-213-4567" "5555555555"  

[[2]]
[1] "613 213 4567"

[[3]]
[1] "(613)2134567"

[[4]]
[1] "575.555.5555"

[[5]]
[1] "7164347566"

The monstrosity found here also works once you take out the ^ and $ at either end. Use only if you really need it:

huge_regex = "(?:(?:\\+?1\\s*(?:[.-]\\s*)?)?(?:\\(\\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\\s*\\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\\s*(?:[.-]\\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\\s*(?:[.-]\\s*)?([0-9]{4})(?:\\s*(?:#|x\\.?|ext\\.?|extension)\\s*(\\d+))?"
Community
  • 1
  • 1
Marius
  • 58,213
  • 16
  • 107
  • 105
6

The qdapRegex now has the rm_phone specifically designed for this task:

x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
  "43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567", 
  "Please contact Mr. Bean (613)2134567",
  "1.575.555.5555 is his #1 number",  
  "7164347566"
)

library(qdapRegex)
ex_phone(x)

## [[1]]
## [1] "613-213-4567" "5555555555"  
## 
## [[2]]
## [1] "613 213 4567"
## 
## [[3]]
## [1] "(613)2134567"
## 
## [[4]]
## [1] "1.575.555.5555"
## 
## [[5]]
## [1] "7164347566"
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
2

You would need a complex regex to cover all rules for matching phone numbers, but to cover your examples.

> library(stringi)
> unlist(stri_extract_all_regex(x, '(\\d[.-])?\\(?\\d{3}\\)?[-. ]?\\d{3}[-. ]?\\d{4}\\b'))
# [1] "2-613-213-4567" "5555555555"     "613 213 4567"   "(613)2134567"  
# [5] "1.575.555.5555" "7164347566" 
hwnd
  • 69,796
  • 4
  • 95
  • 132