2

How do we convert a string to date in cloud datafusion?

I have a column with the value say 20191120 (format of yyyyMMdd) i want to load this into a table in bigquery as date. The table column datatype is also date.

What i have tried so far is that i converted the string to timestamp using "parse-as-simple-date" and i try to convert it to format using format-date to "yyyy-MM-dd", but this step converts it to string and the final load fails. I have even tried to explicitly mention the column as date in the o/p schema as date. But it fails at runtime.

I tried keeping it as timestamp in the pipeline and try loading the date into Bigquery date type.

I noticed in the error that came op was field dt_1 incompatible with avro integer. Is datafusion internally converting the extract into avro before loading. AVRO does not have a date datatype which is causing the isssue?

Trishit Ghosh
  • 235
  • 3
  • 10
  • Trishit: date, timestamp and timestamp (milli) are logical types in avro. Here is a design doc (https://wiki.cask.co/display/CE/Date+and+Time+support+in+Schema) that talks about how support for date and timestamp was added to CDAP(OSS Data Fusion). Are you sure the error you are seeing when pipeline runs is because of date/timestamp? Can you run it as string and see if that pases(to verify if it is the date/timestamp that is the issue). – Ajai Dec 09 '19 at 05:41
  • Yes the issue is with date datatype only. Please observe my series of steps maybe i am missing something. I have a string column of the format yyyyMMdd (e.g 20191120) I convert it to timestamp using parse-as-simple-date. The O/P datatype is now timestamp I then convert it to a format yyyy-MM-dd using format-date. But this changes the datatype of the column to string. The issue now is that the datatype of my bigquery table is date. At the time of loading the BQ load fails. Even when i keep the datatype as timestamp even then the load fails. – Trishit Ghosh Dec 09 '19 at 10:46
  • Trishit: You can parse as simple date using a format., I am not sure if you are first parsing it as simple date and then applying a format after that. But from what I can understand you can apply parse-as-simple-date with the format yyyy-MM-dd and that should give you the field of type Zoneddatetime with that specific format. Either the conversion doesn't happen and we don't surface the problem up in UI or you are applying a format change that is not necessary., Could share a sample of data that I could see (without any sensitive information) so that I can repro steps. – Ajai Dec 09 '19 at 17:06
  • parse-as-simple-date works for me but that converts the field to Zoneddatetime as mentioned by you, but my target in bigquery table has a "date" datatype. The load fails with the error "java.io.IOException: Error occurred while importing data to BigQuery 'Field has incompatible types. Configured schema: date; Avro file: integer.'. There are total 1 error(s) for BigQuery job . Please look at BigQuery job logs for more information" – Trishit Ghosh Dec 10 '19 at 13:24
  • For sample lets say my csv file has only one row and field 20191130. The value is a date value of format yyyyMMdd. i want to load into a big query table with only one column dt_col of datatype date. If i change the datatype of dt_col in BigQuery to timestamp my load works fine, but requirement is to load it into a date datatype – Trishit Ghosh Dec 10 '19 at 13:27
  • ah! ok Trishit: Can you try this, 1. Go to LocalDateTime column in wrangler 2. Open dropdown and click on "Custom Transform" 3. Type `timestamp.toLocalDate()` (timestamp being the column name) After the last step it should convert it into LocalDate type which you can write to bigquery. Hope this helps – Ajai Dec 10 '19 at 20:34
  • I will add this^ as answer if you can confirm if this works for you. – Ajai Dec 10 '19 at 20:36
  • gives me an error at runtime "java.lang.Exception: Stage:Wrangler - Reached error threshold 1, terminating processing due to error : Error encountered while executing 'set-column' : Error encountered while executing 'ts_col.toLocalDate()', at line '1' and column '1'. Make sure a valid jexl transformation is provided" When i apply the transform in the wrangler it does not show any error and the date is also correctly displayed. But pipeline fails at runtime – Trishit Ghosh Dec 11 '19 at 13:41
  • Trishit: I don't this is related to the date conversion error. I believe you are doing a set-column directive somewhere which changes the column name. Can you try using the new column name in the custom transformation you are doing and check if that works? I tried with just the transformation and I am able to write a date column to my bigquery table. – Ajai Dec 12 '19 at 19:13
  • 1
    Yes seems to be some other issue in the pipeline i am trying to use, works when i created a fresh pipeline. The use of toLocalDate() sets the column datatype to date,which solves the issue originally raised :). Please add this as an answer. The issue i currently face might not be related to the original question. – Trishit Ghosh Dec 13 '19 at 05:24

3 Answers3

4

Adding answer for posterity:

You can try doing these,

  1. Go to LocalDateTime column in wrangler
  2. Open dropdown and click on "Custom Transform"
  3. Type timestamp.toLocalDate() (timestamp being the column name)

After the last step it should convert it into LocalDate type which you can write to bigquery. Hope this helps

Ajai
  • 3,440
  • 5
  • 28
  • 41
2

For this specific date format, the Wrangler Transform directive would be:

parse-as-simple-date date_field_dt yyyyMMdd
set-column date_field_dt date_field_dt.toLocalDate()

The second line is required if the destination is of type Date.

Skip empty values:

set-column date_field_dt empty(date_field_dt) ? date_field_dt : date_field_dt.toLocalDate()

References:

https://github.com/data-integrations/wrangler/blob/develop/wrangler-docs/directives/parse-as-simple-date.md

https://github.com/data-integrations/wrangler/blob/develop/wrangler-docs/directives/parse-as-date.md

xgMz
  • 3,334
  • 2
  • 30
  • 23
0

You could try to parse your input data with Data Fusion using Wrangler.

In order to test it out I have replicated a workflow where a Data Fusion pipeline is fed with data coming from BigQuery. This data is then parsed to the proper type and then it is exported back again to BigQuery. Note that the public dataset is “austin_311” and I have used ‘’311_request’ table as some of their columns are TIMESTAMP type.

The steps I have done are the following:

  1. I have queried a public dataset that contained TIMESTAMP data using:

    select * from `bigquery-public-data.austin_311.311_request`
    limit 1000;
    
  2. I have uploaded it to Google Cloud Storage.

  3. I have created a new Data Fusion batch pipeline following this.

  4. I have used the Wrangler to Parse CSV data to custom 'Simple Data' yyyy-MM-dd HH:mm:ss

  5. I have exported Pipeline results to BigQuery.

This qwiklab has helped me through the steps.

Result:

Following the above procedure I have been able to export Data Fusion data to BigQuery and the DATE fields are exported as TIMESTAMP, as expected.


Community
  • 1
  • 1
Joaquim
  • 406
  • 2
  • 10