3

I have 2 tables (url_feed and clean_domains). I'm trying to copy all of the data from url_feed into clean_domains and at the same time insert the domain from the url column.

Also, it will change the status from queued to complete in url_feed after it copies the row to clean_domains.

Here's what the 2 tables look like:

url_feed

id | url                                    | matches | networks                    | status
1  | http://whatever.com/example1.php       | 5       | Facebook::Twitter Inc       | queued
2  | http://www.example.com/other-stuff.php | 2       | MySpace::Facebook::Twitter  | queued
3  | http://www.test.com/random-text        | 12      | Instagram::Twitter          | queued

clean_domains

id | domain       | url                                     | matches | networks                    | status
1  | whatever.com | http://whatever.com/example1.php        | 5       | Facebook::Twitter Inc       | queued
2  | example.com  | http://www.example.com/other-stuff.php  | 2       | MySpace::Facebook::Twitter  | queued
3  | test.com     | http://www.test.com/random-text         | 12      | Instagram::Twitter          | queued

Here's my code:

<?php
$con=mysqli_connect("localhost","redacted","redacted","redacted");

mysqli_query($con,"INSERT INTO clean_domains
(id,domain,url,matches,networks)
  SELECT
    id,
    SUBSTRING_INDEX(REPLACE(REPLACE(REPLACE(REPLACE(url, 'http://', ''), 'https://', ''), 'http://www.', ''), 'https://www.', ''), '/', 1),
    url,
    matches,
    networks
  FROM url_feed
  WHERE status = 'queued'");

mysqli_query($con,"UPDATE url_feed
SET    status = 'complete'
WHERE status = 'queued' AND
id IN (SELECT id
FROM   clean_domains)");

mysqli_close($con);
?>

My code is working for 99% of domains, but I can't quite figure out how to make it work perfectly.

Here's the 3 times it seems to not work perfectly:

  1. Colons - http://example.com:88/testing - URLs like this are having their domains output as example.com:88 whereas I'd want it to be example.com.

  2. IP Addresses - http://188.123.44.12/test.php - For IPs it seems to be correctly entering the IP address into the database. In the example it would enter 188.123.44.12 as the domain - but I don't want that. I only want domain names, so if it's an IP, it shouldn't be copied over. It should be marked as complete in url_feed and move on to the next row.

  3. Subdomains - http://subdomain.whatever.example.com/test.html - This is being entered into the domain column as subdomain.whatever.example.com when I want it to be example.com instead.

The only way I can think of verifying if a domain being entered is truly a domain would be to run a whois query on each of them. If it doesn't come back as a valid one, it removed the first block of text. For example, it wouldn't get a valid result for subdomain.whatever.example.com so then it tries whatever.example.com and then it tries example.com until the result is valid or it skips it and marks the status column as complete.

Any ideas on what I can change to make this work properly?

Here's where I am right now with this:

$_url_string = 'https://testfewfew.dsd.google.co.uk/testing/whatever';
preg_match("/[a-z0-9\-]{1,63}\.[a-z\.]{2,6}$/", parse_url($_url_string, PHP_URL_HOST), $_domain_tld);
echo $_domain_tld[0];
user10848
  • 161
  • 1
  • 8
  • i would suggest use regular expression to extract the domain, if not ip insert the record & update url_feed – Abdul Rehman Jan 01 '16 at 21:08
  • Possible duplicate of [Parsing Domain From URL In PHP](http://stackoverflow.com/questions/276516/parsing-domain-from-url-in-php) – Shadow Jan 01 '16 at 21:16
  • @Shadow my script already does everything listed there. – user10848 Jan 01 '16 at 21:17
  • the third situation seems to be somehow complicated, as there may be domains like `aaaa.bbbb.example.co.us`. a simple regex won't work this out as it doesn't know whether to take `example.co.us` or `co.us`. you may need the http://www.iana.org/domains/root/db list in your code. – Frederick Zhang Jan 01 '16 at 21:18
  • I would not do this in mysql, but in php, hence I marked it as duplicate with the topic linked sbove. Pls read all answers, not just the top ones! The only thing that is not described there is how to get the domain name from IP. Use gethostbyaddr() function for that. – Shadow Jan 01 '16 at 21:20
  • @FrederickZhang I was thinking it may be best to just remove the first section, then the section section, and so on until it found the valid domain. That way it'd work with any domain and a list wouldn't have to be updated every time a new TLD was added. – user10848 Jan 01 '16 at 21:20
  • @user10848 no, your code only does a fraction of what is discussed in the linked topic. You really need to read the contents, not just the title! – Shadow Jan 01 '16 at 21:21
  • @Shadow take a look at the code I just posted under "Here's where I am right now with this". Does that look like it's working to you? It seems to work with 100% of domains to me. Almost seems too easy though lol – user10848 Jan 01 '16 at 22:00

1 Answers1

5

Just use built-in php function parse_url

You can filter a subdomain from a hostname like this

$url = 'http://subdomain.whatever.example.com/test.html';

$data = parse_url($url);

$host = $data['host'];

$hostname = explode(".", $host);
$domain = $hostname[count($hostname)-2] . "." . $hostname[count($hostname)-1];

print $domain;

Will output

example.com

If you have an url with a port, parse_url will deal with it easily, example

$url = 'http://example.com:88/testing';

$data = parse_url($url);

print_r($data);

Will output

Array
(
    [scheme] => http
    [host] => example.com
    [port] => 88
    [path] => /testing
)

And below you check if the hostname is a valid IP address or not

$url = 'http://188.123.44.12/test.php';

$data = parse_url($url);

print_r($data);

$hostIsIpAddress = ip2long($data['host']) !== false;

var_dump($hostIsIpAddress);

Which will output bool(true) or bool(false) respectively

Alex Andrei
  • 7,315
  • 3
  • 28
  • 42
  • Almost works, I thought of doing similar with `parse_url` but I think we've run into the same issue. It works with most domains, but not for ones like `.co.uk`. For example, `http://test.co.uk` become `co.uk` – user10848 Jan 01 '16 at 21:35
  • You've gotten closer than me with PHP, I just can't think of any way to verify if a domain is actually legit without using whois or a constantly updating list of subtlds. – user10848 Jan 01 '16 at 21:38
  • see this answer that deals better with multi-level tlds http://stackoverflow.com/a/21809647/5043552 – Alex Andrei Jan 01 '16 at 21:43
  • Can you take a look at the code I just posted under "Here's where I am right now with this". Does that look like it's working to you? It seems to work with 100% of domains to me. Almost seems too easy though lol – user10848 Jan 01 '16 at 22:01