-2

I have the following code which use C# Regex to find all "http://..... " from my input. This is my code, but I don't find anything. Please tell me what am I missing?

 Match m = Regex.Match(input, "http* ");
 while (m.Success)
 {
   Console.WriteLine("'{0}' found at index {1}.",
     m.Value, m.Index);
   m = m.NextMatch();
 }

This is my input text (wrapped for readability):

I recently moved and have a buI recently moved and have a bunch of stuff for sale.
Most prices are based on my research from CL and ebay. Let me know or make an offer
if you like something from the list. Thanks. IKEA RAMBERG bed frame and Sultan
mattress - $150 http://seattle.craigslist.org/est/fuo/4688883554.html Sanus Platinum
Foundations TV Stand - $75 http://seattle.craigslist.org/est/fuo/4687613962.html
Staples Mission Coffee table and 2 sets of nesting/side tables - $90
http://seattle.craigslist.org/est/fuo/4687499215.html
Like new Hoover SteamVac Carpet Cleaner with Clean Surge, F5914900 - $100
http://seattle.craigslist.org/est/hsh/4687474666.html Hauppauge WinTV-HVR-1600
ATSC/NTSC/QAM Tuner Video Card + Remote - $35
http://seattle.craigslist.org/est/sop/4687372003.html Computer with core 2 quad, 2GB
RAM, nforce MB, 1.5TB HDD and more - $200 http://seattle.craigslist.org/est/sys/4687362266.html
LINKSYS CM100 Cable Modem (works with Comcast) - $15
http://seattle.craigslist.org/est/ele/4687639722.html Various computer parts for sale - $1 I
recently moved and have a buI recently moved and have a bunch of stuff for sale. Most prices
are based on my research from CL and ebay. Let me know or make an offer if you like something
from the list. Thanks. IKEA RAMBERG bed frame and Sultan mattress - $150
http://seattle.craigslist.org/est/fuo/4688883554.html Sanus Platinum Foundations TV Stand - $75
http://seattle.craigslist.org/est/fuo/4687613962.html Staples Mission Coffee table and 2 sets of
nesting/side tables - $90 http://seattle.craigslist.org/est/fuo/4687499215.html Like new Hoover
SteamVac Carpet Cleaner with Clean Surge, F5914900 - $100
http://seattle.craigslist.org/est/hsh/4687474666.html Hauppauge WinTV-HVR-1600 ATSC/NTSC/QAM
Tuner Video Card + Remote - $35 http://seattle.craigslist.org/est/sop/4687372003.html Computer
with core 2 quad, 2GB RAM, nforce MB, 1.5TB HDD and more - $200
http://seattle.craigslist.org/est/sys/4687362266.html LINKSYS CM100 Cable Modem (works with
Comcast) - $15 http://seattle.craigslist.org/est/ele/4687639722.html Various computer parts for
sale - $1 "
user2864740
  • 60,010
  • 15
  • 145
  • 220
n179911
  • 19,547
  • 46
  • 120
  • 162

4 Answers4

1

The problem is that you put an asterisk * after p in your expression "http* ", so your possible matches look like this:

htt
http
httpp
httppp
httpppp

and so on. Since there's no space after p in the input string, your expression does not get any matches.

This expression should match:

Match m = Regex.Match(input, "http\\S* ");

(\S means "any non-whitespace character").

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • @"http\S* " might be more obvious (the @ sign to say that you are not string escaping anything inside the regex) – Lukos Sep 29 '14 at 17:23
1

For starters, check this previous answer on Stack Overdlow. What is the best regular expression to check if a string is a valid URL?

It appears you misunderstood what * means in regex terms.

"http* "

Means htt followed by 0 or more p followed by a space.

* is not a wildcard fileglob as in DOS or UNIX shell.

* in regex means zero or more of the token it follows (in this case that is p)

For the purpose of your input, you could write:

https?://(\S*)

\S captures all non-space ? makes the s optional so you can grab https as well

But for arbitrary input, space is not always the only thing that follows a URL. It could be enclosed in a quoted string, for example, in HTML or Javascript. The follow should allow a URL followed by a space or a non-escaped quote.

https?://([^ "']*)

Using the ^ at the start of the [] means the pattern is an exclusive pattern (anything but these characters) and many times is the easiest way to write a pattern. The alternative is to write a fully inclusive pattern, which means you have to craft a pattern for every legal input you expect to handle.

I can't remember an actual regex for a compliant URL, it is non-trivial, but you can find a few on Google or Stack Overflow. Just for the general idea, I might write something like the following to be an inclusive pattern:

https?://([-+a-zA-Z0-9._&?]*)

As noted in comment below by Lukos, keep in mind the C# escaping. I usually use verbatim strings in C# for regexes.

var pattern = @"https?://\S*";

Community
  • 1
  • 1
codenheim
  • 20,467
  • 1
  • 59
  • 80
  • Actually he wants the space to know when the URL ends. The space is correct. – Jon Grant Sep 29 '14 at 17:15
  • @JonGrant - Sure, maybe for the purpose of his input sample, but in general a space is not the only delimiter that indicates end of a URL. In arbitrary HTML an URL can be delimited by other things. I will, however, revise my answer to take into account his input sample. – codenheim Sep 29 '14 at 17:22
0

Your source code is looking to match this pattern

"http* "

which say to look for the sequence htt, followed by zero or more occurrences of the character p, followed by a literal space (' ') character. You might try matching "http:[^\s]*" which will match the literal text http:, followed by zero or more non-whitespace characters.

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
0

There is an important question before choosing the regex to use. Do you want to find anything that looks like a URL (perhaps starting with http or https) or do you want to only match valid URLs? A valid URL regex is very complicated, a basic one is easier but you risk collecting matches on non-URLs in the text or perhaps invalid ones made to look like real ones!

Lukos
  • 1,826
  • 1
  • 15
  • 29
  • For a simple one, I would suggest @" (http|https)://\S* " since that is more likely to match real URLs than text which might contain http. – Lukos Sep 29 '14 at 17:27