1

I'm scanning a nested directory in a cloud storage bucket. The result doesn't contain the matched value (quote) although I have the include_quote on. Also, how do I get the name of the files that have the matching along with the matched values? I'm using Python. This is what I have so far. As you can see, the API found matching, but I'm not getting the details on which words (and the files) were flagged.

inspect_job = {
  'inspect_config': {
      'info_types': info_types,
      'min_likelihood': MIN_LIKELIHOOD,
      'include_quote': True,
      'limits': {
          'max_findings_per_request': MAX_FINDINGS
      },
  },
  'storage_config': {
      'cloud_storage_options': {
          'file_set': {
              'url':
                  'gs://{bucket_name}/{dir_name}/**'.format(
                      bucket_name=STAGING_BUCKET, dir_name=DIR_NAME)
          }
      }
  }


operation = dlp.create_dlp_job(parent, inspect_job)
dlp.get_dlp_job(operation.name)

Here is the result:

result {
processed_bytes: 64
total_estimated_bytes: 64
info_type_stats {
  info_type {
    name: "EMAIL_ADDRESS"
  }
  count: 1
}
info_type_stats {
  info_type {
    name: "PHONE_NUMBER"
  }
  count: 1
}
info_type_stats {
  info_type {
    name: "FIRST_NAME"
  }
  count: 2
}
Kiso
  • 13
  • 4

3 Answers3

1

You need to follow "Retrieving inspection results" section in https://cloud.google.com/dlp/docs/inspecting-storage and specify save findings action https://cloud.google.com/dlp/docs/reference/rest/v2/InspectJobConfig#SaveFindings

0

I think you're not getting the quote value because your inspectConfig is not quite right: According to the docs located at https://cloud.google.com/dlp/docs/reference/rest/v2/InspectConfig you should be setting

  "includeQuote": true 

Edit: adding info about getting files: Following this example: https://cloud.google.com/solutions/automating-classification-of-data-uploaded-to-cloud-storage

The code for the cloud function resolve_DLP gets the file name from the job details like this

def resolve_DLP(data, context):
...
job = dlp.get_dlp_job(job_name)
...
file_path = (
      job.inspect_details.requested_options.job_config.storage_config
      .cloud_storage_options.file_set.url)
  file_name = os.path.basename(file_path)
...

Edit 2: now I see that the latest python api client that uses 'include_quote': as the dict key.... so that's not it...

Edit 3: From the python api code:

message Finding {
  // The content that was found. Even if the content is not textual, it
  // may be converted to a textual representation here.
  // Provided if `include_quote` is true and the finding is
  // less than or equal to 4096 bytes long. If the finding exceeds 4096 bytes
  // in length, the quote may be omitted.
  string quote = 1;

So maybe smaller files will yield the quotes

Rondo
  • 3,458
  • 28
  • 26
0

Rondo, thanks for your input. I believe the cloud storage example you mentioned only scans one file for each job. It doesn't use the savefindings object.

Josh, you are right. It seems one needs to direct the output to Bigquery or Pub/sub to see the complete result.

From https://cloud.google.com/dlp/docs/inspecting-storage#retrieving-inspection-results:

For complete inspection job results, you have two options. Depending on the Action you've chosen, inspection jobs are:

Saved to BigQuery (the SaveFindings object) in the table specified. Before viewing or analyzing the results, first ensure that the job has completed by using the projects.dlpJobs.get method, which is described below. Note that you can specify a schema for storing findings using the OutputSchema object. Published to a Cloud Pub/Sub topic (the PublishToPubSub object). The topic must have given publishing access rights to Cloud DLP service account that runs the DlpJob sending the notifications.

I got it working by modifying the solution How to scan BigQuery table with DLP looking for sensitive data?.

Here is my final working script:

import google.cloud.dlp
dlp = google.cloud.dlp.DlpServiceClient()

inspect_job_data = {
    'storage_config': {
      'cloud_storage_options': {
          'file_set': {
              'url':
                  'gs://{bucket_name}/{dir_name}/**'.format(
                      bucket_name=STAGING_BUCKET, dir_name=DIR_NAME)
          }
      }
  },
'inspect_config': {
    'include_quote': include_quote,
    'info_types': [
        {'name': 'ALL_BASIC'},
    ],
},
'actions': [
    {
        'save_findings': {
            'output_config':{
                'table':{
                    'project_id': GCP_PROJECT_ID,
                    'dataset_id': DATASET_ID,
                    'table_id': '{}_DLP'.format(TABLE_ID)
                }
            }

        },
    },
]

}

operation = dlp.create_dlp_job(parent=dlp.project_path(GCP_PROJECT_ID), 
inspect_job=inspect_job_data)
Community
  • 1
  • 1
Kiso
  • 13
  • 4