1

I have a list of a million or urls in an mysql table.

I need to cleanse the data (extract domains) so I can be confident about DISTINCT type queries.

Data is in several different types: -

www.domain.tld
domain.tld
http://domain.tld
https://vhost.domain.tld
domain.tld/

There are invalid domains and empty data.

Ideally I'd like to do something along the lines of : -

UPDATE table1 SET domain = website REGEXP '^(https?://)?[a-zA-Z0-9\\\\.\\\\-]+(/|$|\\\\?)'

domain being a new empty field, website being the original url.

martin blank
  • 2,184
  • 3
  • 17
  • 15
  • There is no fixed answer to what 'the same' domain is, it depends on your application domain, use case, morals and personal convictions, and how many vanity TLDs ICANN has created lately. That aside, your regex isn't bad, but the MySQL `REGEXP` only seems to return a boolean. I'd write a little external script to scrub the data and load it in the new column; an alternative is to run a few updates of the type `SET domain = SUBSTR(domain from 8) where LEFT(domain, 7) = 'http://'` until all the cruft is gone. – LHMathies Jul 07 '11 at 13:18

1 Answers1

0

You can't use regex like that in MySQL as is, but apparently you can some some UDFs that implement it. See:

Community
  • 1
  • 1
Qtax
  • 33,241
  • 9
  • 83
  • 121