0

I've been trying to extract any URL present within my 'Text' column in Big Query. The column contains a mixture of text and URLs dotted throughout (a cell might contain more than one URL) I'm trying to use this regexp:

SELECT

  REGEXP_EXTRACT  (Text, r'(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*')
FROM
Data.Text_Files

I currently get 'failed to parse regular expression' when I try to run the query. I've tried modifying it but to no avail.

The regexp works in an online builder but I'm just not sure how to incorporate it into Big Query.

Any help would be much appreciated - or at least pointers on how to incorporate regular expressions into Big Query!

halfer
  • 19,824
  • 17
  • 99
  • 186
Ed Moon Little
  • 77
  • 2
  • 12

2 Answers2

3

Try below - it is for BigQuery Standard SQL (see Enabling Standard SQL and Migrating from legacy SQL)

WITH YourTable AS (
  SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask).  ' AS Text UNION ALL
  SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
  SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text 
)
SELECT 
 id, 
 REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
FROM YourTable

This gives you output with id field, and repeated field with all respective URLs

If you need flattened result - you can use below variation

WITH YourTable AS (
  SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask).  ' AS Text UNION ALL
  SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
  SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text 
)
SELECT
  id, URL    
FROM (
  SELECT id, REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
  FROM YourTable
), UNNEST(URL) as URL

Note: you can use here any regexp that you will be able to find on web - but what a must is - there is only one matching group is allowed! so all inner matching group should be escaped with ?: as you can see it in above examples. So the ONLY group that you expect to see in output should be left as is - w/o ?:

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Hi Mikhail - Many thanks for the input! I've been playing around with this and while your query works well, I'm still trying to understand how to modify it so it queries an existing column in my table. Is the WITH YourTable AS ( SELECT 1 AS id... part only there so as to illustrate your example? I.e with my own query do I just start from SELECT id, REGEXP_EXTRACT...? – Ed Moon Little Oct 23 '16 at 16:16
  • That's correct. Remove WITH part and start with SELECT I'd, REG... Make sure you are using field names that you actually have in your table – Mikhail Berlyant Oct 23 '16 at 16:36
  • It works! This is excellent - have been trying to work this out for ages! Thanks so much for the help – Ed Moon Little Oct 23 '16 at 17:15
  • Glad we finally made it :o) – Mikhail Berlyant Oct 23 '16 at 18:05
  • This example will not give the complete url if it includes hyphens. – Brad Feb 04 '23 at 12:50
2

Your regex has an incomplete capturing group, and has 2 unescaped characters. I don't know which online regex builder you're using, but maybe you forgot to put your new regex into it?

The problems are as follows:

(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*
POINTERS TO PROBLEMS ON THIS LINE --->                             ^1                    ^^2
  1. This is the start of a capturing group with no end. You probably want the ) right before the *.
  2. All slashes need to be escaped. This should probably be \/ or maybe even \/\\.

Here is an example with both of my suggestions implemented: https://regex101.com/r/pt1hqS/1

Good luck fixing it!

Addison
  • 7,322
  • 2
  • 39
  • 55
  • Thanks for the pointers Addison, however I'm still getting "failed to parse expression" when I put it through Big Query? I can see that it works in the regexp builder, but I think I'm missing something when implementing it in this query: `SELECT REGEXP_EXTRACT (Text, r'http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&\/\/=])*') FROM Data.Content` – Ed Moon Little Oct 22 '16 at 08:42
  • You're missing the bracket at the start of the string (it was there before). Make sure you copy/paste code carefully, since most browsers don't make it easy. – Addison Oct 22 '16 at 08:55
  • Or, if you want to use some regex I just modified from [another answer](http://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string), try: `https?:\/\/[\w_-]+(?:\.[\w_-]+)+([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])?` – Addison Oct 22 '16 at 09:02
  • My bad - I've reinserted the starting bracket but I'm still getting the same issue.. the query now looks like this: `SELECT REGEXP_EXTRACT Text, r'(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&\/=])*)' FROM Data.Content` – Ed Moon Little Oct 22 '16 at 09:10
  • Could it be something to do with the way Text, r'... is formatted? – Ed Moon Little Oct 22 '16 at 09:11
  • @EdMoonLittle - Every time you past me your regex, it's slightly different. This time it is failing because there is an extra bracket at the end of it, after the `*`. Please copy carefully - this has happened already. – Addison Oct 22 '16 at 10:54