I have come up with the following code to remove blank lines in a tensorflow dataset before they are processed as CSV input. This works fine so far. Is there a better or more effective way to do this?
def filter_blank_lines(line):
import re
# This function will get called once for each separate item in the dataset.
# print("filter_blank_lines line:", line, type(line))
line2 = line.decode() # re wont take byte string!
# Here we search for useful data, ignoring whitespace, commas and
# other special charactars, which is a subset of \S
m = re.search("\w", line2, re.I|re.M)
# print("filter_blank_lines line2:", line2, "m:", m)
if m is None:
return False # search failed, whitespace only
else:
return True # non-blank line
dataset = dataset.filter(lambda line: tf.py_func(filter_blank_lines, [line], tf.bool, stateful=False))
Background info: A number of the tensorflow demo scripts assume that the input files are squeaky clean with no embedded blank lines or whitespace in the data. When spurious whitespace is added (deliberately or accidentally), the tf.decode_csv
method appears to be the one complaining, with no clues as to which line it is unhappy with. I prefer my code to be tolerant when processing input files. Hence this effort on removing blank lines.
The line below doesn't work because it will remove valid lines that have leading whitespace.
dataset2 = dataset2.filter(lambda line: tf.not_equal(tf.substr(line, 0, 1), " "))
I tinkered with the new tf.regex_replace
in Tensorflow V1.7.0. It does not work here, as filter expects a boolean result. tf.cast
doenst help either.
dataset2 = dataset2.filter(lambda line: tf.cast(tf.regex_replace(line, "^\s*|\s*$", ""), tf.bool))