3

I am trying to tokenize my table fields with a query.

SELECT regexp_split_to_table(mytable.field_name, E'\\s+') from mytable limit 20;

This works when I execute it from psql shell but when I do:

from django.db import connection cursor=connection.cursor() 
cursor.execute("SELECT regexp_split_to_table(mytable.field_name,E'\\s+')
                FROM mytable LIMIT 20") 
cursor.fetchall()

... it fails to return tokens. What am I doing wrong?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
mossplix
  • 3,783
  • 2
  • 26
  • 31
  • it seems psycopg doent execute postgres functions. – mossplix Jan 18 '12 at 08:30
  • Don't know much about Django, but here's an idea: "SELECT * FROM (SELECT regexp_split_to_table(fld, E'\\s+') FROM tbl LIMIT 20) x LIMIT 20". In case Django doesn't understand that a function can return a set of values. – Erwin Brandstetter Jan 18 '12 at 08:41
  • that does not work either. say one on fields is "SOPA blackout", it wont return [("SOPA","blackout")] but will return [("SOPA blackout")] – mossplix Jan 18 '12 at 09:30
  • 1
    So the query **does** return something? Try `' +')` instead of `E'\\s+')` to determine if the escape string is the problem. May be that you have to double the \ to arrive at `E'\\\\s+')`. Just hat a similar problem here in the comment. – Erwin Brandstetter Jan 18 '12 at 09:39

2 Answers2

4

The backslash is treated as meta-character by Django and is interpreted inside double quotes. So one layer of E'\\s+') gets stripped before the string arrives at the PostgreSQL server, which will see E'\s+'). The escape string will result in 's+' which in turn will make regexp_split_to_table() split your strings at any number of s instead of non-printing space, which the character class shorthand \s stands for in regular expressions.

Double your backslashes in the string to get what you intended: E'\\\\s+'):

"SELECT regexp_split_to_table(field_name, E'\\\\s+') FROM mytable LIMIT 20"

As an alternative, to avoid problems with the special meaning of the backslash \, you can use [[:space:]] to denote the same character class:

"SELECT regexp_split_to_table(field_name, '[[:space:]]+') FROM mytable LIMIT 20"

Details in the chapter "Pattern Matching" in the manual.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
2

Thanks to new in Django F, Func, and supprot for postgresql ArrayField you can now call this function like this:

from django.db.models import F, Value, TextField
from django.contrib.postgres.fields import ArrayField
from django.db.models.expressions import Func

MyTable.objects.annotate(
  some_field_splitted=Func(
      F('some_field'), 
      Value(","), 
      function='regexp_split_to_array',
      output_field=ArrayField(TextField())
   )
).filter(some_field_splitted__contains=[HERE_SOME_VALUE])
andilabs
  • 22,159
  • 14
  • 114
  • 151