1

I try to parse a string with http URL in it, for example, the string is like

str = "http://www.abc.com?id=123&key=456 and more text here"

I want to know where the http url link ends, basically I use

string.find(str, "......")

what pattern can I put in there, so that it would consider the URL ends?

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
Joe Huang
  • 6,296
  • 7
  • 48
  • 81
  • 1
    What would be the answer you seek in the example you gave? – lhf Aug 19 '13 at 10:01
  • The precise syntax is given is http://www.w3.org/Addressing/URL/5_URI_BNF.html. – lhf Aug 19 '13 at 10:07
  • I use `string.find(str, "[^$-_+!*',%a%d:/.]")` , is it correct? But it seems weird that it still considers ")" as part of the URL... For example, `str="http://abc)de.com"` the ")" is considered part of the URL (I was expecting `http://abc` to be parsed out) – Joe Huang Aug 19 '13 at 12:07

2 Answers2

1

A simple pattern to match url's would be:
pattern = "https?://[%w-_%.%?%.:/%+=&]+"
string.find(str, pattern)
It's just a starting point that needs improvement to work in all cases - questions how to find URL in a string for other languages are a good hint (for example Regular expression to find URLs within a string). Also http://www.lua.org/pil/20.2.html can be useful.

Also note that paranthesis are allowed in url, for example: http://msdn.microsoft.com/en-us/library/aa752574(VS.85).aspx.

Community
  • 1
  • 1
Ola M
  • 1,298
  • 3
  • 12
  • 27
1

I want to know where the http url link ends

It ends at the space, so just find everything that's not a space:

str:find('%S+')

FYI, if you're just trying to extract that portion of the string, you should use match instead:

str:match('%S+')

EDIT: adding clarification per the discussion below.

Note that we are not trying to parse URLs here. We're parsing tokens in a space-delimited string.

We have to assume that the URL contains no unencoded spaces, because otherwise the URL could be any of the following and we have no way of distinguishing between them:

http://www.abc.com?id=123&key=456
http://www.abc.com?id=123&key=456 and
http://www.abc.com?id=123&key=456 and more
http://www.abc.com?id=123&key=456 and more text
http://www.abc.com?id=123&key=456 and more text here

Again, the URL exists in a sentence where words are delimited by spaces, so we have to assume/require that the URL contains no unencoded spaces, which makes finding its end easy.

Mud
  • 28,277
  • 11
  • 59
  • 92
  • Someone edited my question, so my string was cut off. Now I edited back. Please note the "str". There are some texts after a URL, therefore what I am asking is where the URL ends so that I can separate the URL and the texts after it. – Joe Huang Aug 20 '13 at 00:36
  • I use this now: string.find(str, "[^%%%@%;%=%&%?%$%-%_%+%!%*%'%,%a%d%:%/%.]"), how do you think? – Joe Huang Aug 20 '13 at 00:42
  • Just grab everything before the space: `str:find('%S+')`. *(I'm going to update my post to reflect this new information)* – Mud Aug 20 '13 at 03:15
  • besides space, there are more characters to be considered when a URL ends, or space is the only possible ending character? – Joe Huang Aug 20 '13 at 05:59
  • @Mud - URL is not any string without a space... it has quite well defined syntax. – Ola M Aug 20 '13 at 08:20
  • @OlaM Of course not, but we're not parsing URLs, we're parsing a space-delimited string where the first token happens to be a URL (he's asking how to find its *end*, not its beginning). If the URL contains unencoded spaces then he's *screwed* because there is no possible way to know if he's at a space in a URL or at the beginning of another word. So we necessarily assume that the URL contain no unencoded spaces, as in his example, at which point finding its end is trivial. If he needed to find the beginning as well, we'd make the same necessary assumption and use something like `https?:%S+`. – Mud Aug 20 '13 at 16:15
  • @JoeHuang The space is considered an "unsafe" character in a URL (RFC 1738), but that's moot here. We *have* to assume your URL contains no spaces, because it's contained in a sentence where words are separated by spaces. Look at your example: if spaces were allowed in the URL, how would we know if the URL is **`http://www.abc.com?id=123&key=456`** or **`http://www.abc.com?id=123&key=456 and`** or **`http://www.abc.com?id=123&key=456 and more`** We *have* to assume/require that the URL contain no unencoded spaces, which makes finding its end easy: just look for the first space. – Mud Aug 20 '13 at 16:23
  • @Mud I agree with your solution if indeed we know the string starts from an URL. My interpretation was that we are looking for an URL that can be anywhere in the string. JoeHuang - could you clarify? – Ola M Aug 20 '13 at 20:26
  • I already covered that in the comment you're replying to (e.g. `https?://S+`). If it turns out that he needs to parse the URL out of the middle of a sentence where it might be in contact with punctuation, then his problem gets a lot harder, because some common punctuation is valid in a URL. I'm just presenting the simplest solution to the problem as presented, rather than a solution to a hypothetical problem that neither of our answers solves. – Mud Aug 20 '13 at 20:46
  • @Mud The string always starts with an URL. The URL is never in the middle. Therefore, I am thinking to find the next character that does not belong to URL characters based on the RFC. I end up using this: `string.find(str, "[^%%%@%;%=%&%?%$%-%_%+%!%*%'%,%a%d%:%/%.]")` – Joe Huang Aug 22 '13 at 01:41
  • space cannot exist in a URL string, so `str:find('%S+')` works but there might be other condition that the URL ends without a space, for example `http://123.com/1.html,its a foreign character` – Joe Huang Aug 22 '13 at 01:42