OK, here is a very dirty workaround, that nevertheless does the trick. I am still waiting for IBM support to get back to me with "the canonical way" (although I doubt this exists):
First, define path of the header file
HeaderFilePath = '/data/column_headers.tsv';
Then read the header file. Output is an array.
HeaderFile = localRead(del(location=HeaderFilePath, delimiter = "\t"));
Now I construct two arrays of the same length as the HeaderFile array, in order to use them with arrayToRecord
in the next step. Why I construct two and not just one will be apparent later.
val_array = HeaderFile -> expand -> transform 'some string';
val_array2 = HeaderFile -> expand -> transform 'some other string';
The idea is to build an artificial record schema_record with the same schema as the data and then to get the schema via schemaof
, which then can be used as schema input for reading the data file.
For this one can use
schema_record = arrayToRecord(HeaderFile -> expand,val_array)
Problems:
a) schemaof(schema_record)
returns schema { * }?
. This is because schemas can (seemingly) only be inferred from materialized data, i.e. one has to use schema_record := arrayToRecord(HeaderFile -> expand,val_array)
.
b) Now, using schemaof(schema_record)
returns a schema. Which is good. However, I don't understand why a schema function would do something like this, but the schema record looks something like "header1": @{const: "some string", fixed: 11} string
instead of the expected "header1": string
. Hence this "schema" is pretty much useless. What is worse, there seems to be no way to manipulate that schema object, such that one might be able to remove the @{}
specifications.
Workaround: use function elementsOf
, which returns the schema of elements of an array of schemas. Meaning:
elementsOf([schemaof({a:1,b:3}),{a:1,b:3}]);
>> schema {"a":@{const: 1, fixed: 1} long, "b":@{const: 3, fixed: 1} long}.
However, using schemas with different "const" and "fixed" records will force elementsOf
to fall back to a "raw" schema (without @{})
elementsOf([schemaof({a:1,b:3}),{a:45,b:32}])
>> schema {"a": long, "b": long}.
This is the "dirty workaround" that I use to achieve what I want. (And all this is due to a very strange understanding of what a schema is...)
schema_array := [arrayToRecord(HeaderFile -> expand, val_array),arrayToRecord(HeaderFile -> expand, val_array2)];
DataSchema := elementsOf(schemaof(schema_array));
Data = read(lines(location='/data/data.tsv')) -> transform catch(delToJson($,
{"schema": DataSchema, "delimiter": "\t"}), {"errThresh": 99999999999},$);