Time-related data I initially have as integer in format:
1234 # corresponds to 12:34
2359 # corresponds to 23:59
1) The first option is to describe time as numeric_column:
tf.feature_column.numeric_column(key="start_time", dtype=tf.int32)
2) Another option is to split time into hours and minutes into two separated feature columns:
tf.feature_column.numeric_column(key="start_time_hours", dtype=tf.int32)
tf.feature_column.numeric_column(key="start_time_minutes", dtype=tf.int32)
3) The third option is to maintain a one feature column, but let tensorflow know that it can be described when split into hours and minutes:
tf.feature_column.numeric_column(key="start_time", shape=2, dtype=tf.int32)
Does this split makes sense and what is the difference between options 2) and 3)?
As additional question, I faced with problems how to decode vector data from csv:
1|1|FGTR|1|1|14,2|15,1|329|3|10|2013
1|1|LKJG|1|1|7,2|19,2|479|7|10|2013
1|1|LKJH|1|1|14,2|22,2|500|3|10|2013
How to let tensorflow know that "14,2", "15,1" should be considered as tensors shape=2?
Edit 1:
I found a solution to decode "array"-like data from csv.
In train and evaluate functions I added .map
step to decode data for some columns:
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels)).map(parse_csv)
Where parse_csv implemented as:
def parse_csv(features, label):
features['start_time'] = tf.string_to_number(tf.string_split([features['start_time']], delimiter=',').values, tf.int32)
return features, label
As I think the difference between two separated columns and one column with shape=2
is in a way how "weights" are distributed.