1

Could someone help us with a regular expression in order to detect repeated patterns inside a URL string? The goal is obviously to detect malformed weird URLs.

For example, the following URLs are alright:

http://www.somewhere.com/help/content/21/23/en/
http://www.somewhere.com/help/content/21/24/en/
http://www.somewhere.com/help/content/21/64/en/
http://www.somewhere.com/help/content/21/65/en/
http://www.somewhere.com/help/content/21/67/en/

While this this ones, are incorrect, and should be tagged:

http://www.somewhere.com/help/content/21/content/1/54/en/
http://www.somewhere.com/help/content/21/content/1/62/en/
http://www.somewhere.com/help/content/21/content/8/52/en/

Since content is repeated twice. So far we have been solving this using parse_url and explode, but it looks quite inefficient!

As well, I'm aware that there might be many URLs that repeat a number in the path, or some other value, so any suggestions to solve this issue would be more than welcome.

Thanks a lot!

For a better comprehension of the issue, you can visit the following link and click on "Administrador MySQL":

http://www.elserver.com/ayuda/content/21/65/es/

Chris Russo
  • 450
  • 1
  • 7
  • 21
  • 2
    In your examples, there is no "pattern" that repeats - just the word "content". If that's all you're worried about you can use `substr_count` and if "content" appears more than once, flag it. Otherwise, please post examples of possible patterns that could repeat. – newfurniturey Sep 24 '12 at 11:53
  • Doesnt seem like these are legacy URLs you need to remap to a new system - looks like they are being generated incorrectly within the system. Therefore, i would find and fix the actual problem, not bandaid it by parsing and then correcting the already incorrect URLs. – prodigitalson Sep 24 '12 at 11:55
  • newfurniturey, You are right, however I certainly don't know what else could come up, since this is part of a crawler, the flow of information is actually huge, and there's many websites with development errors that can make the system loop indefinably. – Chris Russo Sep 24 '12 at 11:57
  • prodigitalson, I thought the same until I realized that the same was happening while browsing the websites with chrome and firefox as well. It's a bug in some wiki implementations. – Chris Russo Sep 24 '12 at 11:59
  • 2
    You cannot even begin to try to debug other people's websites. Some websites legitimately have repeated patterns in their URLs. Some may have superfluous repetition semantically, but it is still required to make the site work. If you are making a crawler, you penalise people for supplying bad URLs, you don't try and fix their problems, especially since the range of possible problems when you look at a black box is infinite. – DaveRandom Sep 24 '12 at 12:07
  • Debug other websites? By the "quality" of the question, I would assume firsthand that the OP has a crawler with flaws. And therefore this question - in the way it's written - is just too localized. We won't debug your crawler unless you ask about it. – hakre Sep 24 '12 at 12:12
  • DaveRandom and Hakra, Thanks for the replies, I assume that I didn't made the right question. I just updated the question with a link, so you can see what we need to prevent. And, unfortunately, it's not only one site. – Chris Russo Sep 24 '12 at 13:12

2 Answers2

2

Assuming you have a file (testdata.txt) which contains a list of URLs, one per line, the following tested script will extract those URLs having (at least) one repeated path segment:

<?php // test.php Rev:20120924_0800
$re = '%
    ^                  # Anchor to start of line.
    (?:[^:/?#\s]+:)?   # URI scheme (optional).
    (?://[^/?#\s]*)?   # URI Authority (optional).
    (?:/[^/?#\s]*)*?   # URI path segments (before repeats).
    /([^/?#\s]+)       # $1: Repeated URI path segment.
    (?:/[^/?#\s]*)*?   # URI path segments (between repeats)
    /\1                # $1: Repeated URI path segment.
    (?:/[^/?#\s]*)*    # URI path segments (after repeats).
    (?:\?[^#\s]*)?     # URI query (optional).
    (?:\#\S*)?         # URI fragment (optional).
    $                  # Anchor to end of line.
    %mx';
$text = file_get_contents('testdata.txt');
if (preg_match_all($re, $text, $matches)) print_r($matches[0]);
else echo("no matches!");
?>
ridgerunner
  • 33,777
  • 5
  • 57
  • 69
1

Just some pointers to get you in the right direction:

  • The URIs are not malformed. They are syntactically correct and therefore wellformed.
  • To solve your issue, do not generate these URIs in the first place.
  • If you create a scraper, you need to adhere the standards, including the processing of how to resolve a relative URI to the documents base URI: https://www.rfc-editor.org/rfc/rfc3986#section-4.2

But unless you don't post any code, there is not much we can say. Probably a duplicate questions are:


The example data-set shows that there is a problem with the data:

Base URI: http://www.elserver.com/ayuda/content/21/65/es/
HREF    : content/1/62/es/%BFc%F3mo-ingreso-al-phpmyadmin.html
          (ISO/IEC 8859-1    %BF = ¿    %F3 = ó)

This is correctly resolved to the following absolute URI:

http://www.elserver.com/ayuda/content/21/65/es/content/1/62/es/%BFc%F3mo-ingreso-al-phpmyadmin.html

Which produces the duplicate content. Obviously this is an error done on the website, which can be easily verified by testing:

http://www.elserver.com/ayuda/content/1/62/es/%BFc%F3mo-ingreso-al-phpmyadmin.html

Because you can not see that by just looking on the two URIs that they are the same you need to develop a strategy (or multiple ones) how you want to deal with the problem.

You could for example ...

  • ... compare then contents for duplicates on your own, e.g. create a MD5 and SHA-1 checksum of the content and keep a list. If both checksums are the same, it is highly likely that the content is the same, too.
  • ... decide that if URIs are getting too long, that they are broken.
  • ... establish machine learning to learn which URL patterns create duplicate content.
  • ... create "good enough to try" URIs if there is some overlapping between the base URI and the relative URI given to detect these kind of problems. Test if these URIs work.

Obviously different strategies need you to do more or less work and also have influence on the data-structures and databases you will have with your crawler.

As you can see this is not trivial. Some websites even offer endless URL tarpits to make a crawler give-up. So you should have anyways something more robust here already to make your crawler more robust.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Thanks a lot Hakra, that's a good approach, and I agree, the URLs are not malformed at all. I should edit the question. Regarding the point 2 and 3, the problem doesn't comes from our code, but some web apps. If you would like to see what I mean, you can take a look here, annd click on "Administrador MySQL": http://www.elserver.com/ayuda/content/21/65/es/ – Chris Russo Sep 24 '12 at 13:06
  • Just updated the question, it should be easier to understand the issue now :) – Chris Russo Sep 24 '12 at 13:10
  • @ChrisRusso: Updated the answer, the data you've given much better explains it. But there is no pre-made library that will solve it for you, at least none I'm aware of. I'd say you need to get a bit creative how you want to deal with it. – hakre Sep 24 '12 at 13:29
  • Thanks a lot for helping us, I totally agree with you now, we are currently trying to create a few algorithms that would allow us to detect when the system is looping, we are already comparing the sources using something very similar to the MD5 process you previously described. However I believe it's possible to predict something like this when the results of parse_url start repeating information in the path parameter. I guess we're gonna approach the problem that way, and after that compare the results against similar addresses. Thank again. – Chris Russo Sep 24 '12 at 14:17
  • Yes that prediction goes a bit with machine learning. You would create a world in which the algorithm can do decisions and based on the duplicate content checks can identify when the decision was right or wrong. You need to decide if it's worth to program such a system though. – hakre Sep 24 '12 at 14:58