6

I have to extract column names from a MYSQL SELECT and I'd wish to do that using Regex.
It's a plain SELECT, something like:
SELECT column1, column2 ... FROM table

I have to cover every cases, with our without alias, with or without table in front of it, with or without the quoting char:

SELECT column, column as foo, table.column, table.column as foo, 
       `column`, `column` as foo, `table`.`column`, `table`.`column` as foo
       .....

Currently I've been able to work out this regex: #\w+(\sas)?#i but it's not good vs prefixed columns.
Any help?

By the way, is Regex good at this task?

EDIT
Thanks for the answers!
The patterns you posted are valid for the whole query, actually I'm already processing every single column:

$fields = Frameworkmethod::getSelectFields($query);
$columns = explode(',' , $fields);
foreach($columns as $column)
{
     //do Regex work to "clean up" the single field and get the "standard" one (not the alias)
     //`#__tracktime_projects`.`pr_name` AS `project_name` should return pr_name
}

As stated in the comment above, I always need the field name, not the alias one. Sorry for not pointing it out before!

tampe125
  • 8,231
  • 7
  • 30
  • 45
  • 1
    `is Regex good at this task` that totally depends on the language you are using to program!Specify the language/tool you are using so that we can help you – Anirudha Apr 14 '13 at 18:27

6 Answers6

7

I made use of Collapse and Capture a Repeating Pattern in a Single Regex Expression and adapted it to fit this purpose.

So, a hopefully bulletproof RegEx for capturing column names from a *SQL query :

/(?:SELECT\s++(?=(?:[#\w,`.]++\s++)+)|(?!^)\G\s*+,\s*+(?:`?+\s*+[#\w]++\s*+`?+\s*+\.\s*+)?+`?+\s*+)(\w++)`?+(?:\s++as\s++[^,\s]++)?+/ig

Explained Online demo: http://regex101.com/r/wL7yA9

PHP code using preg_match_all() with single RegEx, commented with /x modifier:

preg_match_all('/(?:SELECT\s++(?=(?:[\#\w,`.]++\s++)+) # start matching on SELECT
                |              # or
                (?!^)\G        # resume from last match position 
                \s*+,\s*+      # delimited by a comma 
                (?:`?+\s*+     # optional prefix table with optional backtick
                    [\#\w]++   # table name
                    \s*+`?+    # optional backtick
                    \s*+\.\s*+ # dot separator
                )?+ # optional prefix table end group

                `?+\s*+ # optional backtick

            ) # initial match or subsequent match

            (\w++)    # capturing group
            `?+         # optional backtick


            (?:\s++as\s++[^,\s]++)?+ # optional alias

            /ix', $query, $matches);

Live code: http://codepad.viper-7.com/VTaPd3

Note: the 'hopefully bulletproof' is aimed at valid SQL


PHP code using explode()

$columns = explode(',', $fields);

foreach($columns as $column)
{
    $regex='/([\w]++)`?+(?:\s++as\s++[^,\s]++)?+\s*+(?:FROM\s*+|$)/i';

    preg_match($regex, $column, $match);

    print $match[1]; // field stored in $match[1]
}

Live code with example extraction: http://codepad.viper-7.com/OdUGXd

Community
  • 1
  • 1
CSᵠ
  • 10,049
  • 9
  • 41
  • 64
2

I used PHP:

$query = 'SELECT column1, column2 as foo, table.column3, table.column4 as foo, 
       `column5`, `column6` as foo, `table`.`column7`, `table`.`column8` as foo
       FROM table';

$query = preg_replace('/^SELECT(.*?)FROM.*$/s', '$1', $query); // To remove the "SELECT" and "FROM table..." parts

preg_match_all('/(?:
    (?:`?\w+`?\.)? (?:`)?(\w+)(?:`)? (?:\s*as\s*\w+)?\s*
#   ^--TableName-^ ^---ColumnName--^ ^----AsFoo-----^
)+/x',$query, $m);

print_r($m[1]);

Output:

Array
(
    [0] => column1
    [1] => column2
    [2] => column3
    [3] => column4
    [4] => column5
    [5] => column6
    [6] => column7
    [7] => column8
)

Live demo: http://www.rubular.com/r/H960NFKCTr


UPDATE: Since you're using some "unusual" but valid SQL table names (e.g.: #__tracktime_projects) it has messed up the regex. So to fix this issue, I added a variable which contains what characters we would expect, I also added the i modifier to make the match caseless:

$query = 'SELECT column1, column2 as foo, table.column3, table.column4 as foo, 
       `column5`, `column6` as foo, `table`.`column7`, `table`.`column8` as foo, `#__tracktime_projects`.`pr_name` AS project_name, `#wut`
       FROM table';


$query = preg_replace('/^SELECT(.*?)FROM.*$/s', '$1', $query); // To remove the "SELECT" and "FROM table..." parts

$allowed = '\w#'; // Adjust this to the names that you expect.

preg_match_all('/(?:
    (?:`?['.$allowed.']++`?\.)?
#   ^--------TableName--------^

    (?:`)?(['.$allowed.']++)(?:`)?
#   ^----------ColumnName--------^

    (?:\s*as\s*['.$allowed.']++)?\s*
#   ^-------------AsFoo------------^
)+
/xi',$query, $m);

print_r($m[1]);

Output:

Array
(
    [0] => column1
    [1] => column2
    [2] => column3
    [3] => column4
    [4] => column5
    [5] => column6
    [6] => column7
    [7] => column8
    [8] => pr_name
    [9] => #wut
)

Live demo: http://www.rubular.com/r/D0iIHJQwB8

HamZa
  • 14,671
  • 11
  • 54
  • 75
  • that's very nice, however I need the field name and not the alias one (sorry I just updated the question) – tampe125 Apr 14 '13 at 20:41
  • testing vs this string extracts the alias and not the column name: http://www.rubular.com/r/SJLbVC0GGR – tampe125 Apr 14 '13 at 21:13
0

If using PHP (i'm sure other languages support this too) you could consider getcolumnmeta. From the docs:

<?php
  $select = $DB->query('SELECT COUNT(*) FROM fruit');
  $meta = $select->getColumnMeta(0);
  var_dump($meta);
?>

Result:

array(6) {
  ["native_type"]=>
   string(7) "integer"
  ["flags"]=>
   array(0) {
  }
  ["name"]=>
   string(8) "COUNT(*)"
  ["len"]=>
    int(-1)
  ["precision"]=>
    int(0)
  ["pdo_type"]=>
    int(2)
}
Martin
  • 6,632
  • 4
  • 25
  • 28
0

This is generic solution:

((select|,)\s+((((`)?\w+\6?\.)?(`)?\w+\7?)(\s+as\s+(`)?\w+\9?)?\s*))+from\s

Group $3 contains column with possible table and alias decoration.

You can use non-capture groups - (?...) for some parts of expression. In such case change backreference numbers.

Run it with single line and ignore case flags. Maybe you should change identifier specification \w+ to more specific [a-zA-Z]\w* for example.

Petr Behenský
  • 620
  • 1
  • 6
  • 17
  • It seems that the tool reports only the last capture. – Petr Behenský Apr 14 '13 at 20:05
  • In your validator use only '(select|,)\s*(((`)?\w+\4?\.)?(`)?\w+\5?(\s+as\s+(`)?\w+\7?)?)\s*' but with isg options. – Petr Behenský Apr 14 '13 at 21:15
  • Not so great. Look at regex101.com/r/rL1bE2. Do you think that result is correct? It should list all 'a,', 'b,' and 'c,'. It's the reason why it doesn't list all columns in my regex. – Petr Behenský Apr 14 '13 at 21:51
  • let's continue this discussion on the [RegEx chat room](http://chat.stackoverflow.com/rooms/25767/regex) to avoid cluttering the comment space, also please delete the comments here – CSᵠ Apr 14 '13 at 21:55
0

This is an old question but I had the same issue and could no solve using @CSᵠ regex. I created another regex to do the task on a wider range of columns, included calculated fields

preg_match_all('/(?<=^SELECT |, |\) )([a-z]+\.)?([a-z]+ )?(as )?([a-z]+)(?= ?,|$)/im');

If will be used in complex queries I suggest use the complete function I created: https://gist.github.com/pedrosancao/2498ed85b3c1834c5bdd

Pedro Sanção
  • 1,328
  • 1
  • 11
  • 16
0

I'm late to the party, but most of these were too complex for me and PHP-specific, which I guess is what you were after but a lot of people looking at this may not be. The PCRE-flavored regex I went with was:

([\w`]+)(?=\s*,|\s+from\s+)

This matches alphanumeric and backticks ([\w`]+) characters that are immediately followed by a comma (or whitespace and comma) or a space and the keyword 'from'.

The latter is accomplished using a positive lookahead

(?=<expression>)

in this case

(?=\s*,|\s+from\s+)

which checks that the preceding token [\w`]+ is followed by

\s*,       # whitespace (0 or more) and comma

OR (|)

\s+from\s+ # the keyword 'from', surrounded by >=1 whitespace characters

Then you can use the case flag in whatever language flavor to ignore case.

EDIT: Should also note this matches the alias name rather than the original column name.

Sibs
  • 81
  • 1
  • 11