3

I am attempting to parse a data frame that has text in each row and within that text there are IP addresses I want to isolate. However, I am still picking up integers, whole numbers and periods. Below is a example of what I am working with.

    z <- data.frame( x =  c('112.68.196.98   5.32', '192.41.196.888', '..','5.32  88'))
    gsub("^\\.+|\\.[^.]*$", "", z$x, perl=TRUE)

I am looking to clean this dataframe so the output would just be:

    z <- data.frame( x =  c('112.68.196.98', '192.41.196.888','',''))

I can't seem to come up with the proper regex to put into the gsub. Thanks.

Justin
  • 33
  • 4
  • 3
    Possible duplicate of [regex ip address from string](http://stackoverflow.com/questions/8439633/regex-ip-address-from-string) – zero323 Jan 07 '16 at 23:06
  • 1
    R uses slightly different syntax. Not a duplicate, just very similar. – Justin Jan 08 '16 at 15:09

1 Answers1

5

I think this should work:

re <- regexpr(
  "(?(?=.*?(\\d+\\.\\d+\\.\\d+\\.\\d+).*?)(\\1|))", 
  z$x, perl = TRUE)

regmatches(z$x, re)
#[1] "112.68.196.98"  "192.41.196.888" ""               ""

This uses a regex conditional, keeping the capture group (\\1) in the case of a positive match on .*?(\\d+\\.\\d+\\.\\d+\\.\\d+).*?, else returning an empty result.


Update:

Regarding your comment, I think the following changes will allow you to capture multiple IP addresses in a single string. First, switch from regexpr to gregexpr to allow multiple results:

re2 <- gregexpr(
  "(?(?=.*?(\\d+\\.\\d+\\.\\d+\\.\\d+).*?)(\\1|))", 
  z2$x, perl = TRUE
)

Since calling regmatches on a gregexpr input will return a list, some additional processing is required:

res2 <- sapply(regmatches(z2$x, re2), function(x) {
  gsub(
    "^\\s+|\\s+$", "", 
    gsub("\\s+", " ", paste0(x, collapse = " "))
  )
}

This should be suitable for, e.g., recombining with your data.frame as a new column:

res2
#[1] "112.68.196.98 192.41.196.888" "192.41.196.888"               
#     ""                             "112.68.196.98" 

And if you did want to break out each result into its own string, the expression is a little simpler (compared to sapply(...)):

lapply(regmatches(z2$x, re2), function(x) {
  Filter(function(y) y != "", x)
})
#[[1]]
#[1] "112.68.196.98"  "192.41.196.888"

#[[2]]
#[1] "192.41.196.888"

#[[3]]
#character(0)

#[[4]]
#[1] "112.68.196.98"

Data:

z2 <- data.frame(
  x = c('112.68.196.98 5.32 192.41.196.888', 
        '192.41.196.888', 
        '..', '5.32 88 112.68.196.98'),
  stringsAsFactors = FALSE
)
nrussell
  • 18,382
  • 4
  • 47
  • 60
  • Thanks nrussell. This works well. One more question, if I had multiple IPs in the same cell, how would I modify this to capture 1 or more IPs. For example: z <- data.frame( x = c('112.68.196.98 5.32 192.41.196.888', '192.41.196.888', '..', '5.32 88 112.68.196.98')) – Justin Jan 08 '16 at 15:07
  • Should multiple ips be split into their own string, or remain combined (e.g. separated by a space)? – nrussell Jan 08 '16 at 15:09
  • For the purposes of my project, I am keeping them combined into one cell/element (separated by a space) and I will break them out later. I want them to keep their associated index. Thanks. – Justin Jan 08 '16 at 15:49