0

I am using Python codecs to write some UTF-8 text to a file

#-*-coding:utf-8-*-
import codecs

filename = 'afile'
with codecs.open(filename, encoding='utf-8', mode='w') as fw :
    fw.write('<DOC>\n<DOCNO>')
    fw.write(filename)
    fw.write('</DOCNO>\n<TEXT>\n')        
    fw.write('কাজ'.decode('utf-8'))
    fw.write('\n</TEXT>\n</DOC>')

Now if I run Lemur (http://www.lemurproject.org/) on the directory with this file, Lemur tells me the document is 'malformed'.

0:00: Opened /home/userA/Documents/test_corpus/afile 
0:00: Error in /home/userA/Documents/test_corpus/afile : ../src/TaggedDocumentIterator.cpp(213): Malformed document: /home/userA/Documents/test_corpus/afile

BUT, if I open the file in gedit, add a random character and delete it (so that the file content remains the same) and then save the file, THEN if I run Lemur, it runs perfectly.

0:00: Opened /home/userA/Documents/test_corpus/afile
0:00: Documents parsed: 1 Documents indexed: 1
0:00: Closed /home/userA/Documents/test_corpus/afile

So is there a difference in the way a text file is being saved, by Python and by gedit, due to which Lemur is responding differently in the two different scenarios?

Avisek
  • 363
  • 1
  • 3
  • 16
  • So what codec did GEdit use to save the file? – Martijn Pieters Apr 07 '15 at 11:01
  • It used the default UTF-8 encoding – Avisek Apr 07 '15 at 11:02
  • Can you save the GEdit format to a new file and verify what the differences are? Perhaps there is a trailing newline added? – Martijn Pieters Apr 07 '15 at 11:05
  • Yes, thank you, that is what was happening. gedit inserts a 'line feed' Unicode : U+000A (http://unicode-table.com/en/search/?q=10) at the end of the file. Lemur likes that. – Avisek Apr 07 '15 at 11:22
  • That's a newline; add `\n` to the end, after the `` string. – Martijn Pieters Apr 07 '15 at 11:23
  • Because the TREC format you're using for documents actually supports multiple documents in a single file, it requires a newline between each. It's a less-flexible format, but meant to go very fast over very large collections. This is one of the reasons the ```` tag appears on its own line. – John Foley Apr 07 '15 at 15:10

1 Answers1

2

You are writing an output file without a newline on the last line:

fw.write('\n</TEXT>\n</DOC>')

GEdit probably adds that extra newline when saving. Add an extra \n:

fw.write('\n</TEXT>\n</DOC>\n')
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343