I'm a python newbie. I've been searching days long, but found only some little bits of my conception. Python 2.7 on windows (I chose python because it's multiplatform and result can be portable on windows).
I'd like to make a script, that searches a folder for *.txt UTF-8 text files, loads the content (one file after each other), changes non-ascii chars to html entitites, next adds html tags at the start and at the end of each line, but 2 variations of tags, one for the head of the file, and one for the tail of the file, which (head-tail) are separated by an empty line. After that, all the result have to be written out to another text file(s), like *.htm. To be visual:
unicode1.txt:
űnícődé text line1
űnícődé text line2
[empty line]
űnícődé text line3
űnícődé text line4
result have to be in unicode1.htm:
<p class='aaa'>űnícődé text line1</p>
<p class='aaa'>űnícődé text line2</p>
[empty line]
<p class='bbb'>űnícődé text line3</p>
<p class='bbb'>űnícődé text line3</p>
I started to develop the core of my solution, but I stucked. See script versions (for simplicity I chose encode with xmlcharrefreplace).
V1:
import re, cgi, fileinput
file="_utf8.txt"
text=""
for line in fileinput.input(file, inplace=0):
line=cgi.escape(line.decode('utf8'),1).encode('ascii', 'xmlcharrefreplace')
line=re.sub(r"^", "<p>", line, 1)
text=text+re.sub(r"$", "</p>", line, 1)
print text
It worked, good result, but for this task fileinput is not a usable way I think.
V2:
import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
line=re.sub(r"^", "<p>", line, 1)
text=text+re.sub(r"$", "</p>", line, 1)
f.close()
print text
It messed up the result, closing tag at line start replacing first letter, etc.
V3 (tried multiline flag):
import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
line=re.sub(r"^", "<p>", line, 1, flags=re.M)
text=text+re.sub(r"$", "</p>", line, 1, flags=re.M)
f.close()
print text
Same result.
V4 (tried 1 regex instead of 2):
import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
text=text+re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)
f.close()
print text
Same result. Please help.
Edit: I just checked the result file with a hexeditor, and there is an x0D byte before each closing tag! Why?
Edit2: changes for a more logical approach
text+=re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)
Edit3: with a hexeditor I saw what was the reason for the messed up result: extra CR (x0D) byte before each CRLF. I tracked down the CR problem, what made that: the concatenation with +
# -*- coding: utf-8 -*-
text=""
f=u"unicode text line1\r\n unicode text line2"
for line in f:
text+=line
print text
This results in:
unicode text line1\r\r\n unicode text line2
Any idea, how to fix this?
űnícődé text line4
" – snim2 Jan 22 '12 at 14:49%s
' % line`. If you don't want to save empty lines, just test for `if not line.strip(): continue` – reclosedev Jan 22 '12 at 15:01", line, 1) text=text+re.sub(r"$", "
", line, 1)? Can't you just do concatenation: text += "\n" + line + "
" – Roman Susi Jan 22 '12 at 15:02%s
' % line` but the result is almost the same, there are x0Dx0Dx0A bytes before closing tag (CR+CR+LF). – Jan 22 '12 at 15:27`,`
` – jfs Jan 22 '12 at 15:29%s
' % line`. I think %s gets the whole line from somewhere, but how, and then? – Jan 22 '12 at 15:37