Questions tagged [google-cloud-dataprep]

An intelligent cloud data service to visually explore, clean, and prepare data for analysis.

DataPrep (or more accurately Cloud Dataprep by Trifacta) is a visual data transformation tool built by Trifacta and offered as part of Google Cloud Platform.

It is capable of ingesting data from and writing data to several other Google services (BigQuery, Cloud Storage).

Data is transformed using recipes which are shown alongside a visual representation of the data. This allows the user to preview changes, profile columns and spot outliers and type mismatches.

When a DataPrep flow is run (either manually or scheduled), a DataFlow job is created to run the task. DataFlow is Google's managed Apache Beam service.

205 questions
9
votes
1 answer

Can Google Data Fusion make the same data cleaning than DataPrep?

I want to run a machine learning model with some data. Before train the model with this data I need to process it, so I have been reading some ways to do it. First of all create a Dataflow pipeline to upload it to Bigquery or Google Cloud Storage,…
8
votes
3 answers

Dataprep vs Dataflow vs Dataproc

To perform source data preparation, data transformation or data cleansing, in what scenario should we use Dataprep vs Dataflow vs Dataproc?
6
votes
2 answers

Can Google Cloud Dataprep monitor a GCS path for new files?

Google Cloud Dataprep seems great and we've used it to manually import static datasets, however I would like to execute it more than once so that it can consume new files uploaded to a GCS path. I can see that you can setup a schedule for Dataprep,…
Matt Byrne
  • 4,908
  • 3
  • 35
  • 52
5
votes
0 answers

Job Fails with odd message

I have a job that is failing at the very start of the message: "@*" and "@N" are reserved sharding specs. Filepattern must not contain any of them. I have altered the destination location to be something other than the default (an email address)…
williamvicary
  • 805
  • 5
  • 20
4
votes
1 answer

Executing a Dataflow job with multiple inputs/outputs using gcloud cli

I've designed a data transformation in Dataprep and am now attempting to run it by using the template in Dataflow. My flow has several inputs and outputs - the dataflow template provides them as a json object with key/value pairs for each input &…
4
votes
1 answer

Google Cloud Dataprep Import Recipes

I can see it's possible to download a recipe but I can't see any option to import it, does anyone know if there is this option?
4
votes
1 answer

Dataprep - Scheduling Jobs

To anyone on the Dataprep beta, is it possible to schedule jobs being run? If so, is it the cron service via the app engine? I can't quite follow the cron for app engine instructions but want to make sure it's not a dead end before I try Thanks
Aaron Harris
  • 415
  • 1
  • 5
  • 15
3
votes
2 answers

How do I run Google Dataprep jobs automatically?

Is there a way to trigger Google Dataprep flow over API? I need to run like 30 different flows every day. Every day the source dataset changes and the result has to be appended to Google BigQuery table. Is there a way to automate this process?…
stkvtflw
  • 12,092
  • 26
  • 78
  • 155
3
votes
1 answer

How to use Google Data Prep API using Python

Google Just launched the new API. Link is here. I want to know what is the host in this case as they are using example.com and using the port 3005. I am also following this article. But this does not provide example code.
3
votes
2 answers

Add dataset parameters into column to use them in BigQuery later with DataPrep

I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like…
WJA
  • 6,676
  • 16
  • 85
  • 152
3
votes
1 answer

How do I chain multiple Google Cloud DataPrep flows?

I've created two Flows in Cloud DataPrep - the first outputs to a BigQuery table and also creates a reference dataset. The second flow takes the reference dataset and processes it further before outputting to a second BigQuery table. Is it possible…
angusham
  • 98
  • 5
3
votes
2 answers

Dataprep: creating a column to converts to Big Query timestamp type

I have been trying like crazy to create a column from an existing Datetime column type that would "publish" to a Big Query "timestamp" column. I have tried every permutations of the functions "unixtime" and "unixtimeformat" functions of Dataprep to…
jldupont
  • 93,734
  • 56
  • 203
  • 318
3
votes
1 answer

DataPrep: access to source filename

Is there a way to create a column with filename of the source that created each row ? Use-Case: I would like to track which file in a GCS bucket resulted in the creation of which row in the resulting dataset. I would like a scheduled transformation…
jldupont
  • 93,734
  • 56
  • 203
  • 318
3
votes
1 answer

Google Dataprep - replace data in columns

I have started to use Google's Dataprep solution to cleanse eCommerce product feeds. As I receive data from 100s of eCommerce stores, I want to cleanse the data for consistency and rename the various spellings of brand names. For example, I have a…
3
votes
1 answer

How to export file with headers in Google Dataprep?

I am trying to export the results of a Google Dataprep job. As you can see in the following screenshot, the columns have names or headers: However, the exported file is not including them. How can I keep those column headers in the exported CSV…
Milton
  • 891
  • 1
  • 13
  • 30
1
2 3
13 14