4

I'm a python newbie. I've been searching days long, but found only some little bits of my conception. Python 2.7 on windows (I chose python because it's multiplatform and result can be portable on windows).

I'd like to make a script, that searches a folder for *.txt UTF-8 text files, loads the content (one file after each other), changes non-ascii chars to html entitites, next adds html tags at the start and at the end of each line, but 2 variations of tags, one for the head of the file, and one for the tail of the file, which (head-tail) are separated by an empty line. After that, all the result have to be written out to another text file(s), like *.htm. To be visual:

unicode1.txt:

űnícődé text line1
űnícődé text line2
[empty line]
űnícődé text line3
űnícődé text line4

result have to be in unicode1.htm:

<p class='aaa'>&#369;n&iacute;c&#337;d&eacute; text line1</p>
<p class='aaa'>&#369;n&iacute;c&#337;d&eacute; text line2</p>
[empty line]
<p class='bbb'>&#369;n&iacute;c&#337;d&eacute; text line3</p>
<p class='bbb'>&#369;n&iacute;c&#337;d&eacute; text line3</p>

I started to develop the core of my solution, but I stucked. See script versions (for simplicity I chose encode with xmlcharrefreplace).

V1:

import re, cgi, fileinput
file="_utf8.txt"
text=""
for line in fileinput.input(file, inplace=0):
  line=cgi.escape(line.decode('utf8'),1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1)
  text=text+re.sub(r"$", "</p>", line, 1)
print text

It worked, good result, but for this task fileinput is not a usable way I think.

V2:

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1)
  text=text+re.sub(r"$", "</p>", line, 1)
f.close()
print text

It messed up the result, closing tag at line start replacing first letter, etc.

V3 (tried multiline flag):

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1, flags=re.M)
  text=text+re.sub(r"$", "</p>", line, 1, flags=re.M)
f.close()
print text

Same result.

V4 (tried 1 regex instead of 2):

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  text=text+re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)
f.close()
print text

Same result. Please help.

Edit: I just checked the result file with a hexeditor, and there is an x0D byte before each closing tag! Why?

Edit2: changes for a more logical approach

text+=re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)

Edit3: with a hexeditor I saw what was the reason for the messed up result: extra CR (x0D) byte before each CRLF. I tracked down the CR problem, what made that: the concatenation with +

# -*- coding: utf-8 -*-
text=""
f=u"unicode text line1\r\n unicode text line2"
for line in f:
  text+=line
print text

This results in:

unicode text line1\r\r\n unicode text line2

Any idea, how to fix this?

  • Indenting 4 spaces creates a code block. Edit your question so that it is more readable. – sgallen Jan 22 '12 at 14:29
  • I used indenting for the first time, but missed the empty paragraph before each indented block. –  Jan 22 '12 at 14:35
  • I'm not sure I really understand your question, I've tried your last script and it seems to get the result you are looking for and the result looks OK in the browser. Can you show the results of your testing with notes where the result is wrong? – snim2 Jan 22 '12 at 14:42
  • @snim2 For me it messed up the result: closing tag at the line start, deleting the first letter, nothing at the line end. I try here a line to show the result if source line is 'text': ext –  Jan 22 '12 at 14:48
  • This is the last line I got with your test data: "

    űnícődé text line4

    "
    – snim2 Jan 22 '12 at 14:49
  • @snim2 Interesting... I got this with python2.7-windows: "#369;nícődé text line4" –  Jan 22 '12 at 14:51
  • I'm using Python 2.7.2+ on Ubuntu 11.11. I'm surprised that there's a difference between platforms, maybe worth checking http://bugs.python.org/ – snim2 Jan 22 '12 at 14:59
  • Regexp for just adding some `

    ` to begin and end of line is overkill, you can just save each line as `'

    %s

    ' % line`. If you don't want to save empty lines, just test for `if not line.strip(): continue`
    – reclosedev Jan 22 '12 at 15:01
  • 1
    Why do you need line=re.sub(r"^", "

    ", line, 1) text=text+re.sub(r"$", "

    ", line, 1)? Can't you just do concatenation: text += "\n

    " + line + "

    "
    – Roman Susi Jan 22 '12 at 15:02
  • @snim2 Ah, I just checked the result file with a hexeditor, and there is an x0D byte _before_ each closing tag! Why? –  Jan 22 '12 at 15:09
  • @Tib I can't see one with `cat -A`? – snim2 Jan 22 '12 at 15:12
  • @Tib it's Carriage return. http://en.wikipedia.org/wiki/Newline – reclosedev Jan 22 '12 at 15:16
  • @reclosedev I know, I have the whole ascii code set in my brain (I was asm developer long time ago) so that's why I'm wondering. –  Jan 22 '12 at 15:20
  • @snim2 Sorry, on windows there is no 'cat' command (I have only a grep.exe and sed.exe) –  Jan 22 '12 at 15:22
  • @Tib does `type` do the same thing? http://stackoverflow.com/questions/60244/is-there-replacement-for-cat-on-windows – snim2 Jan 22 '12 at 15:24
  • @reclosedev I tried `'

    %s

    ' % line` but the result is almost the same, there are x0Dx0Dx0A bytes before closing tag (CR+CR+LF).
    –  Jan 22 '12 at 15:27
  • @Tib: `'\r\n'` is a newline on Windows. You could strip it from the `line` before adding `

    `,`

    `
    – jfs Jan 22 '12 at 15:29
  • @snim2 `type` shows what i wrote first. `#369;nícődé text line4` –  Jan 22 '12 at 15:29
  • @J.F.Sebastian But I need newlines, I don't want to strip it. This extra x0D what I don't need. –  Jan 22 '12 at 15:31
  • @Tib: bytes `\x0D\x0A` are the newline on Windows. Why do you need a newline *before* ``? – jfs Jan 22 '12 at 15:37
  • @reclosedev Anyway I'm not familiar with python (as a newbie) so I don't understand waht this code makes: `'

    %s

    ' % line`. I think %s gets the whole line from somewhere, but how, and then?
    –  Jan 22 '12 at 15:37
  • @J.F.Sebastian I know what is a newline. OK, I need not before, but after the tag. But before the tag, there is no full CRLF newline, only a CR. –  Jan 22 '12 at 15:39
  • `%s` used fro string formating in Python. More info in [docs](http://docs.python.org/library/stdtypes.html#string-formatting) But in Python 3.0+ it's deprecated and replaced with [.format](http://docs.python.org/library/string.html#formatstrings) like in Rob Wouters answer. – reclosedev Jan 22 '12 at 15:49
  • @reclosedev thank you, it resembles a little bit (for me) Bash printf, is it? –  Jan 22 '12 at 16:12
  • @reclosedev: "deprecated" is a strong word. `.format()` is preferred to the `%` in Python3, but `%` doesn't go anywhere. – jfs Jan 22 '12 at 16:22

2 Answers2

3

There's no need for regular expressions at all here, just do this:

with open('utf8.txt') as f:
    class_name = 'aaa'
    for line in f:
        if line == '\n':
            classname = 'bbb'
        else:
            # decode / convert line
            line = '<p class="{0}">{1}</p>\n'.format(class_name, line.rstrip())
        # write line to file

The results you are getting do not look to be caused by the regular expressions as they appear to be correct. The problem is most likely in the line where you do your encoding / converting. Print that line without adding the tags to see if it is as expected.

Gandaro
  • 3,427
  • 1
  • 17
  • 19
Rob Wouters
  • 15,797
  • 3
  • 42
  • 36
  • 1
    it will leave newline before `` – jfs Jan 22 '12 at 15:27
  • @J.F.Sebastian, nice catch. Added `rstrip` to the answer. – Rob Wouters Jan 22 '12 at 15:30
  • 1
    `.rstrip('\n\r')` will preserve `' \t'` at EOL. – jfs Jan 22 '12 at 15:34
  • @J.F.Sebastian, I thought about that but trailing whitespace in a `

    ` didn't seem very useful. If OP still wants to your suggestion should do it.

    – Rob Wouters Jan 22 '12 at 15:39
  • @Rob I made with `'

    {0}

    '.format(line)` but `with codecs.open(file, encoding='utf-8') as f:` and the result is: CR+CR+LF before closing tag ``. Strange, because I see it in one string as `'

    {0}

    '`
    –  Jan 22 '12 at 15:54
  • 1
    @Tib, did you use my original method without rstrip? If so, try my edited answer with rstrip() – Rob Wouters Jan 22 '12 at 15:56
  • @RobWouters It works with rstrip, examined with hexeditor, thanks. (Is `.format` works for python2.7 the same way as python3?) But how to do with 2 types of tag: line1-2 `

    ` and line4-5 with `

    `?

    –  Jan 22 '12 at 16:06
  • @Tib, there might be some differences but they won't matter here. I've edited the answer. – Rob Wouters Jan 22 '12 at 16:08
  • @RobWouters Not working, there is some catch with the `if`, because every line has `

    `. Indentation checked, your typo (classname/class_name) corrected.

    –  Jan 22 '12 at 16:30
  • @RobWouters: I don't understand why the "if" line doesn't find the empty line... I used `f=codecs.open(file, encoding='utf-8')` instead of `with open('utf8.txt') as f:` –  Jan 23 '12 at 00:05
  • @Tib, it's possible the line contains spaces. You could try `if not line.strip():` – Rob Wouters Jan 23 '12 at 00:22
  • @RobWouters Thanks, with `strip()` that worked on the `if` condition, writes `class=bbb` but the empty line disappered from the result... Possibly not the concatenation, but the `for line in f:` adds the extra x0D byte to the `line` variable? –  Jan 23 '12 at 12:45
  • @RobWouters: I tested my question. No, not the `for` adds extra byte. But there is no space in the `line` variable while the `if` testing it... Interesting. Anyway I added before `else` the following: `text+="\n"` and now it's fully working. You may add this to your sourcecode. –  Jan 23 '12 at 13:01
  • @Gandaro: if you edited last time the answer-source: `if line=="\n":` not working, but `if line=="\r\n":` works (but no empty line). –  Jan 23 '12 at 13:08
  • @Tib Well I am using a Unix-like system which uses '\n' for line breaks. – Gandaro Jan 23 '12 at 13:24
1
#!/usr/bin/env python
import cgi
import fileinput
import os
import shutil
import sys

def textfiles(rootdir, extensions=('.txt',)):
    for dirpath, dirs, files in os.walk(rootdir):
        for f in files:
            if f.lower().endswith(extensions):
               yield os.path.join(dirpath, f)

def htmlfiles(files):
    for f in files:
        root, _ = os.path.splitext(f)
        newf = root + '.html'
        shutil.copy2(f, newf)
        yield newf

for line in fileinput.input(htmlfiles(textfiles(sys.argv[1])), inplace=True):
    if fileinput.isfirstline():
       klass = 'aaa' # start head part
    line = cgi.escape(line.decode('utf-8').strip())
    line = line.encode('ascii', 'xmlcharrefreplace')
    if not line: # empty line
       klass = 'bbb' # start tail part
       print(line)
    else:
       print('<p class="%s">%s</p>' % (klass, line))

Example

$ python txt2html.py c:\root\dir
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Added `import sys`. Now works, but only prints lines out, and I'd like it written out to *.htm text file(s). Is there a `fileoutput` also like `fileinput`? –  Jan 22 '12 at 16:35
  • @Tib: there are multiple options e.g., you could wrap `textfiles()` to copy each `'.txt'` file with `shutil.copy2()` and then yield `'.html'` filenames to fileinput (use `inplace=True` in this case). Or close/open new file inside `if fileinput.isfirstline()`. – jfs Jan 22 '12 at 16:45
  • I have to investigate and learn the docs because I did not understand totally what you wrote :-) Remember, I just started with python :-) –  Jan 22 '12 at 16:58
  • @Tib: I've added `htmlfiles()` function to illustrate the previous comment. Note: the data is read/written twice in this case. – jfs Jan 22 '12 at 17:11
  • @JF: Thanks, it works, I accept it as a solution. But what do you mean "data is read/written twice"? Once read, and once write, or else? –  Jan 22 '12 at 22:50
  • 1
    @Tib: `shutil.copy` reads `.txt` file, writes `.html` file; `fileinput.input()` reads `.html`, writes `.html`: 4 times in total. – jfs Jan 23 '12 at 03:25