2

I am trying to achieve the following in JAQL and am stuck.

I have two files: File data.tsv, which contains tab separated data, and a file header.tsv, which contains exactly one line with tab separated values, corresponding to the "header" of file data.tsv.

What I want to achieve is to read data.tsv using:

read(lines(location='data.tsv')) -> transform catch(delToJson($, {"schema": schema_json, "delimiter": "\t"}), {"errThresh":99999999999},$);

For this I need schema_json, a schema definition. I'd like to create this schema_json from file header.tsv (and assigning every field the type "string").

Reading header.tsv is straight forward, and putting it into a record of type header_record = {"header1": string, "header2":string, ....} as well. However how do I transform the jaql record header_record to an object of type schema: schema_json = schema {"header1":string,"header2":string, ....}?

Blaubaer
  • 654
  • 1
  • 5
  • 15

1 Answers1

1

OK, here is a very dirty workaround, that nevertheless does the trick. I am still waiting for IBM support to get back to me with "the canonical way" (although I doubt this exists):

First, define path of the header file

HeaderFilePath = '/data/column_headers.tsv';

Then read the header file. Output is an array.

HeaderFile = localRead(del(location=HeaderFilePath, delimiter = "\t"));

Now I construct two arrays of the same length as the HeaderFile array, in order to use them with arrayToRecord in the next step. Why I construct two and not just one will be apparent later.

val_array = HeaderFile -> expand -> transform 'some string';
val_array2 = HeaderFile -> expand -> transform 'some other string';

The idea is to build an artificial record schema_record with the same schema as the data and then to get the schema via schemaof, which then can be used as schema input for reading the data file. For this one can use

schema_record = arrayToRecord(HeaderFile -> expand,val_array)

Problems:

a) schemaof(schema_record) returns schema { * }?. This is because schemas can (seemingly) only be inferred from materialized data, i.e. one has to use schema_record := arrayToRecord(HeaderFile -> expand,val_array).

b) Now, using schemaof(schema_record) returns a schema. Which is good. However, I don't understand why a schema function would do something like this, but the schema record looks something like "header1": @{const: "some string", fixed: 11} string instead of the expected "header1": string. Hence this "schema" is pretty much useless. What is worse, there seems to be no way to manipulate that schema object, such that one might be able to remove the @{} specifications.

Workaround: use function elementsOf, which returns the schema of elements of an array of schemas. Meaning:

elementsOf([schemaof({a:1,b:3}),{a:1,b:3}]); 
>> schema {"a":@{const: 1, fixed: 1} long, "b":@{const: 3, fixed: 1} long}.

However, using schemas with different "const" and "fixed" records will force elementsOf to fall back to a "raw" schema (without @{})

elementsOf([schemaof({a:1,b:3}),{a:45,b:32}])
>> schema {"a": long, "b": long}.

This is the "dirty workaround" that I use to achieve what I want. (And all this is due to a very strange understanding of what a schema is...)

schema_array := [arrayToRecord(HeaderFile -> expand, val_array),arrayToRecord(HeaderFile -> expand, val_array2)];

DataSchema := elementsOf(schemaof(schema_array));

Data = read(lines(location='/data/data.tsv')) -> transform catch(delToJson($,
{"schema": DataSchema, "delimiter": "\t"}), {"errThresh": 99999999999},$);
Blaubaer
  • 654
  • 1
  • 5
  • 15