The problem of BOM in Java is covered in Reading UTF-8 - BOM marker. It seems that it can be abstracted away using BOMInputStream from Apache's Commons or it has to be removed manually, i.e.
(defn debomify
[^String line]
(let [bom "\uFEFF"]
(if (.startsWith line bom)
(.substring line 1)
line)))
(doall (map #(.length %) (.split (debomify (slurp "test.txt")) "\n")))
If you want to read a file lazily using line-seq
, for instance because it's huge, you have to treat the first line using debomify
. Remaining ones can be read normally. Hence:
(defn debommed-line-seq
[^java.io.BufferedReader rdr]
(when-let [line (.readLine rdr)]
(cons (debomify line) (lazy-seq (line-seq rdr)))))