3

I am new to Google Cloud DLP and I ran a POST https://dlp.googleapis.com/v2beta1/inspect/operations to scan a .parquet file within a Google Cloud Storage directory and also using cloudStorageOptions to save the .csv output.

The .parquet file is 53.93 M.

When I make the API call on the .parquet file I get :

"processedBytes": "102308122",
"infoTypeStats": [{
   "infoType": {
      "name": "AMERICAN_BANKERS_CUSIP_ID"
   },
   "count": "1"
}, {
   "infoType": {
      "name": "IP_ADDRESS"
   },
   "count": "17"
}, {
   "infoType": {
      "name": "US_TOLLFREE_PHONE_NUMBER"
   },
   "count": "148"
}, {
   "infoType": {
      "name": "EMAIL_ADDRESS"
   },
   "count": "30"
}, {
   "infoType": {
      "name": "US_STATE"
   },
   "count": "22"
}]

When I convert the .parquet file to .csv I get a 360.58 MB file. Then if I make the API call on the .csv file I get:

"processedBytes": "377530307",
"infoTypeStats": [{
   "infoType": {
      "name": "CREDIT_CARD_NUMBER"
   },
   "count": "56546"
}, {
   "infoType": {
      "name": "EMAIL_ADDRESS"
   },
   "count": "372527"
}, {
   "infoType": {
      "name": "NETHERLANDS_BSN_NUMBER"
   },
   "count": "5"
}, {
   "infoType": {
      "name": "US_TOLLFREE_PHONE_NUMBER"
   },
   "count": "1331321"
}, {
   "infoType": {
      "name": "AUSTRALIA_TAX_FILE_NUMBER"
   },
   "count": "52269"
}, {
   "infoType": {
      "name": "PHONE_NUMBER"
   },
   "count": "28"
}, {
   "infoType": {
      "name": "US_DRIVERS_LICENSE_NUMBER"
   },
   "count": "114"
}, {
   "infoType": {
      "name": "US_STATE"
   },
   "count": "141383"
}, {
   "infoType": {
      "name": "KOREA_RRN"
   },
   "count": "56144"
}],

Obviously when I scan the .parquet file not all the infoTypes are detected compared to running the scan on the .csv file where I verified that all EmailAddresses were detected.

I couldn't find any documentation on compressed files such as parquet, so I am assuming that Google Cloud DLP doesn't offer this capability.

Any help would be greatly appreciated.

Dan Cornilescu
  • 39,470
  • 12
  • 57
  • 97
  • A little unclear what you're asking. Try to be more specific as to what you would like answered. – lloyd Sep 01 '17 at 23:04
  • My question is : How do I scan .parquet file(s) in Google Cloud Storage using DLP (Data Loss Prevention)? I provided the output when I scanned the .parquet file and then further provided the output when I scanned the same .parquet file converted to a .csv to show the inconsistencies. – Kenzie Tahiri Sep 01 '17 at 23:46

1 Answers1

2

Parquet files are currently scanned as binary objects, as the system does not parse them smartly yet. In the V2 api the supported file types are listed here https://cloud.google.com/dlp/docs/reference/rpc/google.privacy.dlp.v2#filetype.

Jordanna Chord
  • 950
  • 5
  • 12