Length of the first line in an UTF-8 file with BOM

Question

Good afternoon. Suppose I have an utf-8 file with a single letter, say "f" (no \n and spaces) and I try to get a sequence of line lengths.

(with-open [rdr (reader "test.txt")] 
  (doall (map #(.length %) (line-seq rdr))))

And I get

=> (2)

Why? Is there any elegant way to get the right length of the first string?

I cannot reproduce it. I used your code with UTF-8 file containing one- or two-byte characters, both with or without `\n` at the end. In all cases I got `(1)`. What's your Clojure version? — Jan, Dec 09 '12 at 16:15
Just a random thought, what if you put a BOM in your test files ? — SirDarius, Dec 09 '12 at 16:17
My Clojure version is 1.4. Yes, in reality that is BOM. How could I bypass the problem? — Oleg Leonov, Dec 09 '12 at 16:39

score 8 · Accepted Answer · edited May 23 '17 at 12:25

8

The problem of BOM in Java is covered in Reading UTF-8 - BOM marker. It seems that it can be abstracted away using BOMInputStream from Apache's Commons or it has to be removed manually, i.e.

(defn debomify
  [^String line]
  (let [bom "\uFEFF"]
    (if (.startsWith line bom)
      (.substring line 1)
      line)))

(doall (map #(.length %) (.split (debomify (slurp "test.txt")) "\n")))

If you want to read a file lazily using line-seq, for instance because it's huge, you have to treat the first line using debomify. Remaining ones can be read normally. Hence:

(defn debommed-line-seq
  [^java.io.BufferedReader rdr]
  (when-let [line (.readLine rdr)]
    (cons (debomify line) (lazy-seq (line-seq rdr)))))

edited May 23 '17 at 12:25

Community

1
1

answered Dec 09 '12 at 16:45

Jan

11,636
38
47

Thank you. Perhaps this is a solution. – Oleg Leonov Dec 09 '12 at 17:01
Thanks for more detailed version. – Oleg Leonov Dec 09 '12 at 17:25
Maybe the more optimal method is to do simply (debomify (slurp "test.txt")) and then split it. – Oleg Leonov Dec 09 '12 at 18:45
@ОлегЛеонов, thanks, you're absolutely right. I've fixed the answer. – Jan Dec 09 '12 at 19:07
@MichielBorkent, ...a lazy approach would be welcome. Thanks for pointing it out. – Jan Dec 09 '12 at 20:18
@MichielBorkent, that would indeed work, however there's no need to call `debomify` on each line of the read file. `debommed-line-seq` calls `debomify` only on the first one. – Jan Dec 09 '12 at 20:52
I'm afraid we've lost laziness: (is (instance? clojure.lang.LazySeq (debommed-line-seq (reader "test.txt")))) => false – Oleg Leonov Dec 10 '12 at 08:46
@OlegLeonov, the first line is evaluated strictly and subsequent ones in a lazy seq. If you'd like the first one to be lazy as well wrap the call to `cons` in `lazy-seq`. – Jan Dec 10 '12 at 08:48
I appreciate your help. It seems to me that lazy-seq in expression (lazy-seq (line-seq rdr)) is unnecessary. I'd like to suggest final variant: (defn debommed-line-seq [^java.io.BufferedReader rdr] (when-let [line (.readLine rdr)] (lazy-seq (cons (debomify line) (line-seq rdr))))) – Oleg Leonov Dec 10 '12 at 09:03
@OlegLeonov, in such case you'll get a lazy seq, whose tail will be a normal, strict cons, whose tail will be a lazy seq. This time the second element is non-lazy :). Also, take a look at the source of Clojure's [`line-seq`](http://www.clodoc.org/doc/clojure.core/line-seq). That's what I've based `debommed-line-seq` on. `line-seq` reads the first line strictly and remaining ones are read lazily. – Jan Dec 10 '12 at 09:09

Length of the first line in an UTF-8 file with BOM

1 Answers1