I believe the most likely explanation is that the submissions returned from the query don't all have the same number of fields, and the way you are constructing the dataframe cannot handle this. I'm going to suggest two options to work around this, then I'll explain in more detail what I think is happening.
Option 1: convert to dicts
You could convert each namedtuple record into a dictionary. This should be safer because then pandas won't assume that every record has the same set of fields in the same order. If some records have an extra field then pandas will create a column for it and fill it with NaN for all the other records.
submission_results_df = pd.DataFrame(result._asdict() for result in submission_results)
Option 2: use the psaw CLI instead
I note that the psaw
library you are using has a command-line interface which can save directly to JSON or CSV. Perhaps this would avoid your difficulties if you are in fact only using pandas to convert the data to CSV.
Explanation
I haven't directly reproduced the problem using the data from Redis, but I can explain what appears to be happening here. submission_results
contains a list of namedtuples, created in _wrap_thing. (I previously mis-read the source code and thought these were instances of praw.models.reddit.submission but that's only if you have provided a reddit API object during construction.)
The error message "Assertion error: 16 columns passed, passed data had 17 columns" appears to comes from pandas _validate_or_indexify_columns and indicates that it expects 16 columns but has received data for 17 columns. I'm not 100% clear which code-path it took to get here, but I include below an example that gets the same error using namedtuple
.
I think it's not a great idea to be passing a list of objects into the DataFrame constructor directly. The constructor can interpret data in a number of different formats, including some that don't seem to be clearly documented. When it gets a list of named-tuples, it uses the first named-tuple to determine the field names and then converts each item into a list to extract the fields. If this is true, then somewhere in your data at least one of the objects has 17 fields instead of 16. I have no idea if psaw
makes any particular guarantee that all objects will have the same number of fields, or even if the fields will appear in the same order even when they are the same.
Related reproduction of the same error message using namedtuple
instead:
from collections import namedtuple
from pandas import DataFrame
RGB = namedtuple('RGB', 'red green blue')
RGBA = namedtuple('RGBA', 'red green blue alpha')
# This works:
d_okay = DataFrame([RGB(1,2,3),RGB(4,5,6)])
# This fails:
d_bad = DataFrame([RGB(1,2,3),RGB(4,5,6),RGBA(7,8,9,0)])
Traceback (most recent call last):
File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 497, in _list_to_arrays
content, columns, dtype=dtype, coerce_float=coerce_float
File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 581, in _convert_object_array
f"{len(columns)} columns passed, passed data had "
AssertionError: 3 columns passed, passed data had 4 columns
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "repro.py", line 11, in <module>
d_bad = DataFrame([RGB(1,2,3),RGB(4,5,6),RGBA(7,8,9,0)])
File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 474, in __init__
arrays, columns = to_arrays(data, columns, dtype=dtype)
File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 461, in to_arrays
return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 500, in _list_to_arrays
raise ValueError(e) from e
ValueError: 3 columns passed, passed data had 4 columns