-1

I am having trouble using regular expressions to extract a longitude and latitude from a string. The string is this:

[1] "\"42.352800\" data-longitude=\"-71.187500\" \"22\"></div>"

I want to be able to get both the first number "42.352800" and the second number "-71.187500" separately as two variables. Because I'll be doing this on a bunch of entries, I need to make sure that it can get these numbers whether they are positive or negative.

I figured I should be using a regular expression to say basically:

latitude <- from " to " (to get the first number)

and then something similar to get the longitude.

Any ideas here? I am relatively new to regex.

r2evans
  • 141,215
  • 6
  • 77
  • 149
Andrew Colin
  • 155
  • 1
  • 11
  • It looks as if you are scraping HTML (based on the ``). It might be better to look at the data source itself instead of regex, in case it is more parseable. (Regex should not always be your first attempt at accessing data.) – r2evans Jun 08 '20 at 01:09

1 Answers1

0

I agree with @r2evans that if you are scraping this information from a webpage it would be much simpler to get data using rvest for example.

To answer your question, you can use str_match to get first two numbers.

string <- "\"42.352800\" data-longitude=\"-71.187500\" \"22\"></div>"

stringr::str_match(string, '(\\d+\\.\\d+).*?(-?\\d+\\.\\d+)')[, -1]
#[1] "42.352800"  "-71.187500"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you very much for this answer. Works like a charm. Could you explain to me a little more as to what is going on with this expression? Would this still work if the first number were negative? Also, good point on rvest. The reason I am not using it is because the location data is not something you can grab using rvest. It's hidden underneath other things in the web page and I was unable to extract it using rvest. – Andrew Colin Jun 08 '20 at 02:05
  • 1
    No, this will not work if first number is negative, You need to include the same regex as the second number in that case `(-?\\d+\\.\\d+).*?(-?\\d+\\.\\d+)` where `-?` means an optional negative sign which may or may not occur. – Ronak Shah Jun 08 '20 at 02:20