0

I'm trying to store some data I've scraped from an API to a dataframe, then to write it to a .csv. This works often, but the script sometimes breaks with this error message:

AssertionError: 16 columns passed, passed data had 17 columns

Anyone know what's going on here? Code is below -- it breaks after "pass one"

from psaw import PushshiftAPI
import datetime as dt
import pandas as pd

api = PushshiftAPI()
start_epoch=int(dt.datetime(2018, 6,2).timestamp())
end_epoch=int(dt.datetime(2018, 12, 31).timestamp())

subreddit = input('Which subreddit would you like to scrape? ')

submission_results = list(api.search_submissions(after=start_epoch,
                                                 before=end_epoch,
                                                 subreddit=subreddit,
                                                 filter=['id', 'title', 'subreddit', 'num_comments', 'score', 'author', 'is_original content', 'is_self', 'stickied', 'selftext',
                  'created_utc', 'locked', 'over_18', 'permalink', 'upvote_ratio',
                  'url'], limit = None))

print ('pass one')

submission_results_df = pd.DataFrame(submission_results)
print ('pass two')
submission_results_df.fillna('NULL')
print('pass three')
submission_results_df.to_csv('D:/CAMER/%s_Submittisons-%s-%s.csv'.format(start_epoch, end_epoch) %(subreddit, start_epoch, end_epoch))
Z4-tier
  • 7,287
  • 3
  • 26
  • 42
  • To be able to answer this, it would be helpful to know: 1. the exact text of the error message and stack trace - as it is we can only _assume_ it occurs in `pd.DataFrame(submission_results)`, 2. what is in `submission_results`, both normally and specifically when the error occurs - this information is directly available to you but it's hard for us to guess without going and finding docs for `PushshiftAPI`. – Weeble Aug 21 '20 at 00:57
  • Code works for me. What subreddit are you using that is giving this error? @Weeble PushshiftAPI is available in pypi. – Z4-tier Aug 21 '20 at 01:06
  • I get this error on both r/petioles and r/trees (I'm on a research project about cannabis), but only for certain date ranges. As a novice, it seems like it might have something to do with missing values for certain submissions? – AnonymousCoward Aug 21 '20 at 01:34

1 Answers1

0

I believe the most likely explanation is that the submissions returned from the query don't all have the same number of fields, and the way you are constructing the dataframe cannot handle this. I'm going to suggest two options to work around this, then I'll explain in more detail what I think is happening.

Option 1: convert to dicts

You could convert each namedtuple record into a dictionary. This should be safer because then pandas won't assume that every record has the same set of fields in the same order. If some records have an extra field then pandas will create a column for it and fill it with NaN for all the other records.

submission_results_df = pd.DataFrame(result._asdict() for result in submission_results)

Option 2: use the psaw CLI instead

I note that the psaw library you are using has a command-line interface which can save directly to JSON or CSV. Perhaps this would avoid your difficulties if you are in fact only using pandas to convert the data to CSV.


Explanation

I haven't directly reproduced the problem using the data from Redis, but I can explain what appears to be happening here. submission_results contains a list of namedtuples, created in _wrap_thing. (I previously mis-read the source code and thought these were instances of praw.models.reddit.submission but that's only if you have provided a reddit API object during construction.)

The error message "Assertion error: 16 columns passed, passed data had 17 columns" appears to comes from pandas _validate_or_indexify_columns and indicates that it expects 16 columns but has received data for 17 columns. I'm not 100% clear which code-path it took to get here, but I include below an example that gets the same error using namedtuple.

I think it's not a great idea to be passing a list of objects into the DataFrame constructor directly. The constructor can interpret data in a number of different formats, including some that don't seem to be clearly documented. When it gets a list of named-tuples, it uses the first named-tuple to determine the field names and then converts each item into a list to extract the fields. If this is true, then somewhere in your data at least one of the objects has 17 fields instead of 16. I have no idea if psaw makes any particular guarantee that all objects will have the same number of fields, or even if the fields will appear in the same order even when they are the same.


Related reproduction of the same error message using namedtuple instead:

from collections import namedtuple
from pandas import DataFrame

RGB = namedtuple('RGB', 'red green blue')
RGBA = namedtuple('RGBA', 'red green blue alpha')

# This works:
d_okay = DataFrame([RGB(1,2,3),RGB(4,5,6)])

# This fails:
d_bad = DataFrame([RGB(1,2,3),RGB(4,5,6),RGBA(7,8,9,0)])
Traceback (most recent call last):
  File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 497, in _list_to_arrays
    content, columns, dtype=dtype, coerce_float=coerce_float
  File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 581, in _convert_object_array
    f"{len(columns)} columns passed, passed data had "
AssertionError: 3 columns passed, passed data had 4 columns

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "repro.py", line 11, in <module>
    d_bad = DataFrame([RGB(1,2,3),RGB(4,5,6),RGBA(7,8,9,0)])
  File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 474, in __init__
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 461, in to_arrays
    return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
  File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 500, in _list_to_arrays
    raise ValueError(e) from e
ValueError: 3 columns passed, passed data had 4 columns
Weeble
  • 17,058
  • 3
  • 60
  • 75