How can I add empty columns to the output schema in data fusion - wrangle?

Question

I'm developing a pipeline in Data Fusion that must read a JSON from Google Cloud Storage, transform some fields (erase or rename some of them) and then send the info into a BigQuery table.

I'm doing the transformation in Wrangle. My problem is that some columns have no value in the input JSON document, and I cannot include these fields in the output schema. I have tried using the "Keep column" directive unsuccessfully.

Also, these empty columns have no type, although it is defined in the json where the schema is defined when importing the input file.

How should I treat the empty fields so that they appear in the output schema?

Thanks and regards

Can you make these fields nullable in the schema? – user3126412 Oct 19 '22 at 21:21 — user3126412, Oct 19 '22 at 21:21
@user3126412: Yeah, they're nullable already – ars_nei Oct 20 '22 at 08:42 — ars_nei, Oct 20 '22 at 08:42

score 0 · Answer 1 · answered Oct 19 '22 at 17:32

The error is telling you that you're trying to put an integer in a column that expects an integer. This is because you're trying to use the empty string as the type in the mapping document. The fact that it doesn't give you an error when you map to a non-empty field in the output schema doesn't mean that there are no errors.

It is required to specify the type of each field when the fields are empty when exporting from Wrangle. For example, let’s say I wanted to empty a field that was typed int64. I would do the following.

# empty int64 fields for (i in 1:length(x)){for (j in 1:length(x[i])) {if (length(x[i][j])) {x[i][j] = ""}} x[i]}

Note: that I'm using length here to show that it's really just a question of length, length is basically x[i][j]=="" for empty fields.

You will want to be careful to get your values the way you want them. It may help to look at what I wrote in my blog post using Wrangle as an example.

I've worked through your problem and see why you don't want to use "Empty" as your column type.

It looks like what you're trying to do is create a pipeline where there is an input schema that you can see, then output something to a new table with more columns.

But it doesn't look like you want to keep the column names for the columns you're not using - so you may want to change the type of your column in your source schema.

In Wrangle you can have two mappings - one that takes an empty value, and one that doesn't.

The empty value case is easy. Use the empty string as the value.

For the case where you want to create a field that's a string but the field value is "", that's pretty straightforward - just use "" as the value.

Hi,Not sure if I understand. Neither the pipeline nor wrangle return error. In the wrangle I defined a schema with the type of each field, but in wrangle the fields for which there is no value in the entry file appear as "unknown" type. I tried changing manually their type in wrangle, but when I apply the directives, these fields aren't populated to the output schema. As a workaround, I have added in the input file a line with all the fields filled. By doing so, all the fields have their type in wrangle and are populated to the output schema after applying the rest of the policies. — ars_nei, Oct 20 '22 at 08:46

score 0 · Answer 2 · answered Oct 20 '22 at 08:56

I have found 2 possible workarounds in case someone has the same issue:

1.: I have added in the input file a line with all the fields filled. By doing so, all the fields have their type in wrangle and are populated to the output schema after applying the rest of the policies.

2.: Export the pipeline and edit its JSON, adding the missing fields to the output schema of the wrangle step

In any case I find it strange that having an input schema with the types of each field defined the wrangle behaves like this. I hope to find the answer when I learn more about the tool.

How can I add empty columns to the output schema in data fusion - wrangle?

2 Answers2