1

I have a web app (react + node.js) running on App Engine.
I would like to kick off (from this web app) a Machine Learning job that requires a GPU (running in a container on AI platform or running on GKE using a GPU node pool like in this tutorial, but we are open to other solutions).
I was thinking of trying what is described at the end of this answer, basically making an HTTP request to start the job using project.job.create API.

More details on the ML job in case this is useful: it generates an output every second that is stored on Cloud Storage and then read in the web app.

I am looking for examples of how to set this up? Where would the job configuration live and how should I set up the API call to kick off that job? Are the there other ways to achieve the same result?

Thank you in advance!

3 Answers3

1

On Google Cloud, all is API, and you can interact with all the product with HTTP request. SO you can definitively achieve what you want.

I personally haven't an example but you have to build a JSON job description and post it to the API.

Don't forget, when you interact with Google Cloud API, you have to add an access token in the Authorization: Bearer header


Where should be your job config description? It depends...

If it is strongly related to your App Engine app, you can add it in App Engine code itself and have it "hard coded". The downside of that option is anytime you have to update the configuration, you have to redeploy a new App Engine version. But if your new version isn't correct, a rollback to a previous and stable version is easy and consistent.

If you prefer to update differently your config file and your App Engine code, you can store the config out of App Engine code, on Cloud Storage for instance. Like that, the update is simple and easy: update the config on Cloud Storage to change the job configuration. However there is no longer relation between the App Engine version and the config version. And the rollback to a stable version can be more difficult.

You can also have a combination of both, where you have a default job configuration in your App Engine code, and an environment variable potentially set to point to a Cloud Storage file that contain a new version of the configuration.

I don't know if it answers all your questions. Don't hesitate to comment if you want more details on some parts.

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
1

As mentionated, you can use the AI Platform api to create a job via a post. Following is an example using Java Script and request to trig a job. Some usefull tips:


  • Jobs console to create a job manually, then use the api to list this job then you will have a perfect json example of how to trig it.

  • You can use the Try this API tool to get the json output of the manually created job. Use this path to get the job: projects/<project name>/jobs/<job name>.

  • Get the authorization token using the OAuth 2.0 Playground for tests purposes (Step 2 -> Access token:). Check the docs for a definitive way.


Not all parameters are required on the json, thtas jus one example of the job that I have created and got the json using the steps above.

JS Example:

var request = require('request');

request({
  url: 'https://content-ml.googleapis.com/v1/projects/<project-name>/jobs?alt=json',
  method: 'POST',
  headers: {"authorization": "Bearer ya29.A0AR9999999999999999999999999"},
  json: {
    "jobId": "<job name>",
    "trainingInput": {
      "scaleTier": "CUSTOM",
      "masterType": "standard",
      "workerType": "cloud_tpu",
      "workerCount": "1",
      "args": [
        "--training_data_path=gs://<bucket>/*.jpg",
        "--validation_data_path=gs://<bucket>/*.jpg",
        "--num_classes=2",
        "--max_steps=2",
        "--train_batch_size=64",
        "--num_eval_images=10",
        "--model_type=efficientnet-b0",
        "--label_smoothing=0.1",
        "--weight_decay=0.0001",
        "--warmup_learning_rate=0.0001",
        "--initial_learning_rate=0.0001",
        "--learning_rate_decay_type=cosine",
        "--optimizer_type=momentum",
        "--optimizer_arguments=momentum=0.9"
      ],
      "region": "us-central1",
      "jobDir": "gs://<bucket>",
      "masterConfig": {
        "imageUri": "gcr.io/cloud-ml-algos/image_classification:latest"
      }
    },
    "trainingOutput": {
      "consumedMLUnits": 1.59,
      "isBuiltInAlgorithmJob": true,
      "builtInAlgorithmOutput": {
        "framework": "TENSORFLOW",
        "runtimeVersion": "1.15",
        "pythonVersion": "3.7"
      }
    }
  }
}, function(error, response, body){
  console.log(body);
});

Result:

... 
{
 createTime: '2022-02-09T17:36:42Z',
  state: 'QUEUED',
  trainingOutput: {
    isBuiltInAlgorithmJob: true,
    builtInAlgorithmOutput: {
      framework: 'TENSORFLOW',
      runtimeVersion: '1.15',
      pythonVersion: '3.7'
    }
  },
  etag: '999999aaaac='
ewertonvsilva
  • 1,795
  • 1
  • 5
  • 15
0

Thank you everyone for the input. This was useful to help me resolve my issue, but I wanted to also share the approach I ended up taking:

I started by making sure I could kick off my job manually. I used this tutorial with a config.yaml file that looked like this:

workerPoolSpecs:
  machineSpec:
    machineType: n1-standard-4
    acceleratorType: NVIDIA_TESLA_T4
    acceleratorCount: 1
  replicaCount: 1
  containerSpec:
    imageUri: <Replace this with your container image URI>
    args: ["--some=argument"]

When I had a job that could be kicked off manually, I switched to using the Vertex AI Node.js API to start the job or cancel it. The API exists in other languages. I know my original question was about HTTP requests, but having an API in the language was a lot easier for me, in particular because I didn't have to worry about authentification.

I hope that is useful, happy to provide mode details if needed.