3

I have come up with the following code to remove blank lines in a tensorflow dataset before they are processed as CSV input. This works fine so far. Is there a better or more effective way to do this?

def filter_blank_lines(line):
   import re
   # This function will get called once for each separate item in the dataset.
   # print("filter_blank_lines line:", line, type(line))
   line2 = line.decode() # re wont take byte string!
   # Here we search for useful data, ignoring whitespace, commas and
   # other special charactars, which is a subset of \S
   m = re.search("\w", line2, re.I|re.M)
   # print("filter_blank_lines line2:", line2, "m:", m)
   if m is None:
      return False # search failed, whitespace only
   else:
      return True # non-blank line

dataset = dataset.filter(lambda line: tf.py_func(filter_blank_lines, [line], tf.bool, stateful=False))

Background info: A number of the tensorflow demo scripts assume that the input files are squeaky clean with no embedded blank lines or whitespace in the data. When spurious whitespace is added (deliberately or accidentally), the tf.decode_csv method appears to be the one complaining, with no clues as to which line it is unhappy with. I prefer my code to be tolerant when processing input files. Hence this effort on removing blank lines.

The line below doesn't work because it will remove valid lines that have leading whitespace.

dataset2 = dataset2.filter(lambda line: tf.not_equal(tf.substr(line, 0, 1), " "))

I tinkered with the new tf.regex_replace in Tensorflow V1.7.0. It does not work here, as filter expects a boolean result. tf.cast doenst help either.

dataset2 = dataset2.filter(lambda line: tf.cast(tf.regex_replace(line, "^\s*|\s*$", ""), tf.bool))
mikkola
  • 3,376
  • 1
  • 19
  • 41
John Brearley
  • 489
  • 1
  • 4
  • 10
  • 2
    I don't think you need tensorflow or `re` to do this. Is your input just a csv file? – user3483203 Apr 04 '18 at 14:55
  • Could you run the regex not as a argument to `filter`, but just `map` each dataset element with it? – mikkola Apr 04 '18 at 15:18
  • You could probably combine the techniques of using `strip()` and `if row:` from these questions: https://stackoverflow.com/questions/10794245/removing-spaces-and-empty-lines-from-a-file-using-python, https://stackoverflow.com/questions/4521426/delete-blank-rows-from-csv – Andrew Zick Apr 04 '18 at 15:29
  • py_func is probably the best option at this point. If you use prefetch after py_func, your python code can run in parallel with your training during `session.run()`. You can make your filter_blank_lines more efficient if that matters. You can also file a feature request for `tf.regex_match`. – iga Apr 05 '18 at 00:57
  • Thanks for the pointers to the CSV module. I had found pandas.read_csv(fn, skipinitialspace=True) works very well and happily ignores blank lines and strips spaces from individual fields. You just have to specify skipinitialspace=True, which is defaulted to False. The numpy.loadtxt routine needs lots extra code for converters to remove spaces, no option to filter blank lines. – John Brearley Apr 05 '18 at 15:05
  • I requested a new feature, tf.regex_match, see https://github.com/tensorflow/tensorflow/issues/18264#issuecomment-379171582 – John Brearley Apr 06 '18 at 13:03

0 Answers0