PushShift: Scrape Submissions from timeframe

Question

I am trying to scrape submissions from WBS containing the TSLA ticker. I have the below code which is intended to take the top 25 submissions for each hour in the timeframe. I had a similar code for comments which worked really well for me, but now I can't figure out why my code is not working for submissions. I changed the base_url (I left in some of the urls I tried to use) and also change 'body' to 'selftext' in my code.

The error given is: ValueError: arrays must all be same length I will post the entire traceback if it helps.

year=2020

month = 6

start_date =1

days = 2

subreddit = "wallstreetbets"



def number_of_days_in_month(year=2020, month=6):

return monthrange(year, month)[1]

if month > 12 or days > number_of_days_in_month(year, month):

raise Exception()

submission_urls = []

# sorted by score

base_url = "https://api.pushshift.io/reddit/submission/search/?sort=desc&sort_type=num_comments&size=25&subreddit={}
"

#base_url = "https://api.pushshift.io/reddit/submission/search?limit=25&sort_type=score&sort=desc&subreddit={}
"

#base_url = "https://api.pushshift.io/reddit/search/submission/?selftext=TSLA
"

#base_url = "https://api.pushshift.io/reddit/submission/search?limit=25&sort_type=score&sort=desc&subreddit={}
"

#base_url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=WallStreetBets&after=2d&before=1d&q
=TSLA'



def downloadsubmissionsFromUrl(base_url):

count = 0

submission_temp = {"id":[], "author":[], "selftext":[], "created_utc":[], "permalink":[]}

for j in range(start_date,days+1):

# for every hour in a day

for i in range(0,23,2): # 2 hour steps

count += 1

current_day_start = int(dt.datetime(year,month, j, i, 0).timestamp())

current_day_end = int(dt.datetime(year,month, j, i, 59,59).timestamp())

url = base_url + f"&after={current_day_start}&before={current_day_end}"

new_url = url.format(subreddit)

print(new_url)

issue = False

try:

json = requests.get(new_url, timeout=5)

except:

json = None

issue = True

print("request failed - skipping")

try:

json_data = json.json()

except:

print(json)

# i guess if it didnt fail above?

else:



if 'data' not in json_data:

issue = True

objects = json_data['data']

if len(objects) == 0:

issue = True

if not issue:

for submission in objects:

try:

submission_temp["id"].append(submission['id'])

submission_temp["author"].append(submission['author'])

submission_temp["selftext"].append(submission['selftext'])

submission_temp["created_utc"].append(time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(submission['created_utc'])))

submission_temp["permalink"].append(f"https://reddit.com{submission['permalink']}")



except Exception as err:

print(traceback.format_exc())

else:

print("JSON request failed")

time.sleep(5)

return pd.DataFrame(submission_temp)

submissions = downloadsubmissionsFromUrl(base_url)

score -2 · Answer 1 · edited Aug 04 '21 at 06:26

-2

I had the same error and what helped me was to check my arrays (as per your error:

ValueError: arrays must all be same length

My guess is your arrays are:

submission_temp["id"]
submission_temp["author"]
etc.

You can try to print them to verify this. From my understanding DataFrames can be visualized as Excel Tables and each column is built from your arrays. Hence if you have arrays of different lengths, it would be as if you have a column with 5 rows and another one with 7 rows and the DataFrame constructor does not know how to interpret the missing rows. From what I read in this post, here is what you can try:

df = pd.DataFrame.from_dict(dictionary_name, orient='index')
df = df.transpose()

Although in my case I did not always need the second line.

edited Aug 04 '21 at 06:26

4b0

21,981
30
95
142

answered Aug 03 '21 at 21:28

gael1130

33
1
7

2

A link to a solution is welcome, but please ensure your answer is useful without it: [add context around the link](//meta.stackexchange.com/a/8259) so your fellow users will have some idea what it is and why it’s there, then quote the most relevant part of the page you're linking to in case the target page is unavailable. [Answers that are little more than a link may be deleted.](/help/deleted-answers) – Adrian Mole Aug 03 '21 at 21:29
Thank you for your comments, I did my best but I am still new at this. I won't do it again – gael1130 Aug 03 '21 at 22:06
Now I edited my answer, I hope it is good now. Let me know if you have any tips – gael1130 Aug 03 '21 at 23:43

PushShift: Scrape Submissions from timeframe

1 Answers1