2

I am very new to GCP and was not sure if Cloud Functions is the way to go for this.

  1. I have a python script which makes a call to the twitter api using tweepy and generates a csv file with a list of tweets for that particular username.
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import tweepy
import datetime
import csv

def fetchTweets(username):
  # credentials from https://apps.twitter.com/
  consumerKey = "" # hidden for security reasons
  consumerSecret = "" # hidden for security reasons
  accessToken = "" # hidden for security reasons
  accessTokenSecret = "" # hidden for security reasons

  auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
  auth.set_access_token(accessToken, accessTokenSecret)

  api = tweepy.API(auth)

  startDate = datetime.datetime(2019, 1, 1, 0, 0, 0)
  endDate =   datetime.datetime.now()
  print (endDate)

  tweets = []
  tmpTweets = api.user_timeline(username)

  for tweet in tmpTweets:
      if tweet.created_at < endDate and tweet.created_at > startDate:
          tweets.append(tweet)

  lastid = ""
  while (tmpTweets[-1].created_at > startDate and tmpTweets[-1].id != lastid):
      print("Last Tweet @", tmpTweets[-1].created_at, " - fetching some more")
      lastid = tmpTweets[-1].id
      tmpTweets = api.user_timeline(username, max_id = tmpTweets[-1].id)
      for tweet in tmpTweets:
          if tweet.created_at < endDate and tweet.created_at > startDate:
              tweets.append(tweet)

  # # for CSV

  #transform the tweepy tweets into a 2D array that will populate the csv   
  outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8")] for tweet in tweets]

  #write the csv    
  with open('%s_tweets.csv' % username, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["id","created","text"])
    writer.writerows(outtweets)
  pass

  f = open('%s_tweets.csv' % username, "r")
  contents = f.read()
  return contents

fetchTweets('usernameofusertoretrieve') # this will be set manually in production
  1. I wanted to run this script and retrieve the results (either as the csv file or as the return contents) over an http request for e.g. using javascript. The script only needs to be run once a day. But the data generated (csv) should be available as required.

My question therefore is

a. is GCP Cloud Functions the correct tool for the job? or will this require something more extensive and therefore a GCP VM instance?

b. What would need to be changed in the code to make it run on GCP?

Any help/advice about the direction is also appreciated.

Coola
  • 2,934
  • 2
  • 19
  • 43
  • 1
    This is pretty broad question. Cloud Functions provides a compute framework that scales to 0 and satisfies REST requests. Cloud Functions does not have persistent storage so one would have to use a database of Cloud Storage. One possibility would be to run the Cloud Function as a scheduled job once a day that results in a CSV stored in a GCS bucket and then the requestor would retrieve the content of the file directly. Basically, one Cloud Function call to retrieve your data from twitter and create the GCS file and everything else is just retrieval of that file. – Kolban Jan 21 '20 at 04:32
  • Thank you so much for a detailed comment. It really helped me. I did some more reading and came to the same solution of using a GCS bucket. – Coola Jan 21 '20 at 16:06

1 Answers1

3

Your questions aren't easy to answer without more detail. But, I will try to provide some insight

is GCP Cloud Functions the correct tool for the job? or will this require something more extensive and therefore a GCP VM instance?

It depends. Is your processing duration will take less than 9 minutes with 1 CPU? And is your process will take less than 2Gb of memory (app memory footprint + file size + tweets array size)?

Why the file size? Because only the /tmp directory is writable and it's an in-memory file system.

If you need up to 15 minutes of timeout, you can have a look to Cloud Run, very similar to Cloud Function and I personally prefer. The limitation in CPU and memory are the same between Cloud Function and Cloud Run (but it should change in 2020 with more CPU and Memory)

What would need to be changed in the code to make it run on GCP?

Start by writing and reading to/from the /tmp directory. At the end, if you want that your file is available all the day long, store it in Cloud Storage (https://cloud.google.com/storage/docs) and retrieve it at the beginning of the function. If not exist, generate it for the current day, else get the existing one.

Then, replace the signature of the function def fetchTweets(username): with def fetchTweets(request): and get the username in the request parameters

Eventually, set up a Cloud Scheduler if you want a generation every day.


You didn't speak about security. I recommend you to deploy your function in private mode

So, there is a lot of GCP serverless concept in this answer, I don'y know your knowledge on GCP. If you want precisions on some parts, don't hesitate to ask!

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
  • Thank you so much for taking time to responding in such detail. I do not think that the function is very intensive (both in CPU or disk space) as the twitter api restricts the data to 3200 tweets. So the file produced is only a few 100s of kb. Thanks for also pointing out the security aspect, which I completely overlooked. I was able to implement a Cloud Scheduler and give it (only it) access to invoke the HTTP trigger. If you could highlight a better way then please feel free to elaborate your answer. Like I said, I am a beginner and very new to GCP. – Coola Jan 21 '20 at 16:11
  • You can look at this answer: https://stackoverflow.com/questions/59825183/restrict-access-to-google-cloud-function/59838643#59838643 – guillaume blaquiere Jan 21 '20 at 16:33