2

I have data in a table that look like this:

src_id, dst_id, params
int   , int   , array

I'm querying the data to extract some values from the array with the following Django query

dataset = query_set.values_list('src_id', 'dst_id', *[e.field for e in settings])

I need to output the data like this:

[
[1,2,[a,b,c,d]],
[3,4,[a,c,d,e]],
...
]

but the values_list returns a list of tuples like (1,2,a,b,c,d)(3,4,a,c,d,e)...

So I've been doing the following to recombine the tuple values into an array:

[[d[0], d[1], d[2:]] for d in dataset]

It works fine, but I have 1M+ rows, and it is slow. It is also using a lot of memory. Ideally I want to format the data properly in the SQL (through Django) and push the output directly to the JSON serializer without having to generate an intermediate array.

Is there a way to optimize this code?

I looked at using a loop, or lambda but that doesn't make much of a difference. I looked at using array but it only takes primitive types, so no much luck with array of arrays.

I am looking for a way to query the DB and output the data directly in the right format, if possible in Django:

I see Django can do slice array lookup like

dataset = query_set.values_list('src_id', 'dst_id', 'data__3_10')

to get the values between indices 3 and 10 for example, but I need to retrieve specific indices (like 1,4,5,6,8,11 for example), not slices.

Any hint on how to format the output of a query like this? is it possible, or just too esoteric for Django? Do I need raw SQL for this?

Thanks

hynekcer
  • 14,942
  • 6
  • 61
  • 99
MrE
  • 19,584
  • 12
  • 87
  • 105
  • The formatting is unclear that strings in the array should be really without quotes and that no string can contain a comma or "]". If I know, inidividual elements of array (params) can be used as a lookup in `.filter(data__3=...)`, but can not be referenced in values_list. How you achieved it? – hynekcer Oct 16 '17 at 01:33
  • The question was first unclear. Now I understand from your answer that `[1,2,[a,b,c,d]],` is not at all a formated output, but a symbolic representation of a nested object, where "a,b,c,d" are values of any identical type. Everything is finally formated by JSON style, that is not a part of any example, because the question is general. (fixed now a typo above) – hynekcer Oct 25 '17 at 17:42

2 Answers2

1

If the database is PostgreSQL or Oracle then memory requirements can be substantially decreased in Django 1.11 by iterator() method applied to any queryset, including values() or values_list() queryset. Then a server-side cursor is used by default without doing any caching in Django or in the database driver. Read details in the docs.

Demo

qs = query_set.values_list('src_id', 'dst_id', 'data')
for row in qs.iterator():
    print(row) 

Output

(3, 4, ['a', 'c', 'd', 'e'])
...

Complete example

qs = query_set.values_list('src_id', 'dst_id', 'data')
with open(..., 'w') as f:
    f.write('[\n')
    for src_id, dst_id, data in qs.iterator():
        str_sub_array = ','.join(data[x] for x in column_settings)
        f.write('[{},{},[{}]],\n'.format(src_id, dst_id, str_sub_array)
    f.write(']\n')

Memory requirements are negligible.

hynekcer
  • 14,942
  • 6
  • 61
  • 99
  • thanks for the info. that may help although i'm not on 1.11 yet. I can see how iterator would reduce memory req, however the question was more about formatting the data as an array out of the queryset, while picking only necessary values out of the original array. I haven't found a way to do that in Django, only through writing a raw SQL query. Once the data is formatted right, all i need is return it, but without it I do need to iterate over each row, which is slow. – MrE Oct 16 '17 at 00:06
  • 1
    @MrE: Updated answer. Formating is trivial if requirements are clear, memory requirements without named cursor supported by Django 1.11 are a hard problem and raw sql probably can't help much. I think that memory and a fast loop is the necessary basis. – hynekcer Oct 16 '17 at 01:15
  • my goal was to eliminate the formatting loop altogether, by having the SQL return the right format. My current loop takes up to 2min, while the query takes 10sec. The iterator might save memory for a loop, but in the end I need to return the whole dataset, so if the SQL returns it all in one shot, I'll save the memory required to build the output. There doesn't seem to be a Django way for this, so I'll go with raw SQL. – MrE Oct 16 '17 at 16:23
  • Thanks for the tip on iterator(). It did help in the end reducing memory footprint, thanks to the json serializer accepting an iterator as input. The main part of the question was about formatting things as an array from SQL in Django though, so I don't feel this should be the accepted answer. – MrE Oct 19 '17 at 14:43
  • I expected by the title "Format Django QuerySet" that you want to improve it not to avoid it. You did not paste any source about formatting that is slow, to see that it can not be easily improved by recommended best practices. You did not write a ratio: number_of_used_array_field_values / length_of_array_field. One half is absolutely negligible while one hundredths is really very problematic. – hynekcer Oct 19 '17 at 15:27
  • that code: `[[d[0], d[1], d[2:]] for d in dataset]` is the formatting of the data from a list of floats to 2 floats and an array. Maybe it wasn't obvious because the only hint here is the `:` in `d[2:]`. As for number used vs length of array, it varies depending on the query. I'll try to make the title more meaningful – MrE Oct 19 '17 at 15:31
1

I was able to format the output directly in Django without resorting to raw SQL, using the extra() feature:

dataset = queryset.objects\
               .extra(select={'data': 'ARRAY[{}]'.format(",".join([fields])})\
               .values_list('src_id','dst_id','data')
return dataset.iterator()

This allows to select the specific indices in the SQL array as I needed.

Returning the iterator (as suggested by @hynekcer), as opposed to the result array itself, actually works better for memory.

Having avoided the loop to reformat the data, the iterator can be sent directly to the JSON serializer which accepts an iterator as an input.

MrE
  • 19,584
  • 12
  • 87
  • 105
  • A normal iterator is not serializable directly by `json` or `simplejson`. There are some similar solutions by `default` or `cls` parameters or by subclassing `JSONEncoder` that convert the iterator to list internally. Better solutions for huge data are those that cross (subclass) a type list with iterator type on order to combine low memory requirements and direct serializability. One solution is my [SerializableGenerator(iterator)](https://stackoverflow.com/a/46841935/448474). If you serialize to a stream by `json.dump` not to a string then also output memory requirements are low. – hynekcer Oct 25 '17 at 18:14
  • A simple "Query expression function" can be written that doesn't require ".extra()", useful if the query uses more tables and a syntax "relation_field__other_field" is can be compiled by the backend without hard coding table names. Your solution is probably the easiest for a simple case. (Excuse that my answer was not complete. Your question was broad initially and I waited for a feedback after the first step.) – hynekcer Oct 25 '17 at 18:36
  • `ujson` can take an iterator as an input and works fine. That's what I use. it's also one of the faster serializers out there. – MrE Oct 27 '17 at 05:47