Python Regex - Identifying the first and last items in a list

Question

I need to transform some text files into HTML code. I'm stuck in transforming a list into an HTML unordered list. Example source:

some text in the document
* item 1
* item 2
* item 3
some other text

The output should be:

some text in the document
<ul>
    <li>item 1</li>
    <li>item 2</li>
    <li>item 3</li>
</ul>
some other text

Currently, I have this:

r = re.compile(r'\*(.*)\n')
r.sub('<li>\1</li>', the_text_document)

which creates an HTML list without < ul > tags.
How can I identify the first and last items and surround them with < ul > tags?

Just iterate over the document line by line, and check the regex. Whenever you succesffully match, start a new
,and whenever you stop matching, put in a
. — Guy Adini, Jul 08 '12 at 14:41
Thanks for the answer. Since I perform a series of various regex replacements on the document, I'd prefer to use a regex for this situation as well. However, if I can't find one, this would probably be the solution. — user1102018, Jul 08 '12 at 14:47

Levon · Answer 1 · 2012-07-08T18:06:19.463

You could just process you data line by line .. this quick and dirty solution below could probably be tidied up, but for your data it does the trick.

with open('data.txt') as inf:
    star_count = 0
    for line in inf:
        line = line.strip()

        if not line.startswith('*'):
            if star_count == 1:
                print'</ul>'
            print line
        else:
            if star_count == 0:
                print '<ul>'
                star_count = 1
            print '  <li>%s</li>'  %line.split('*')[1].strip()

yields:

some text in the document
<ul>
  <li>item 1</li>
  <li>item 2</li>
  <li>item 3</li>
</ul>
some other text

Depending on how complex your data, or if you have repeating unumbered lists etc this will require modification and you may want to look for a more general solution, or modify this starter code to fill your needs, only you can decide.

Update:

Edited <li> .. </li> print line to get rid of * that were previously left.

Thanks. Actually there could be a number of lists in the document. As I answered @Guy Adini, I would probably use this solution unless I find a regex to accomplish the task. — user1102018, Jul 08 '12 at 14:57
@user1102018 You are welcome. I also just updated my answer by modifying the print line which was previously unintentionally leaving the `*` in the itemized list generated. — Levon, Jul 08 '12 at 18:07

Loïc Faure-Lacroix · Answer 2 · 2012-07-08T15:38:19.920

1

Or use BeautifulSoup

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

edit

I apparently have to give you some hint on how to read documentation.

Open the link
On the left there is a big menu (teal color)
If you look carefully you will notice that the documentation is divided in multiple sections
- Stuffs
- Navigation in the tree
- Searching the tree
- Modifying the tree (got it)
- Output (got it!)

And many more things

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Don't stop reading after the first sentence... The last one is pretty important and what's in the middle to.

In other word, you can create an empty document... let say:

soup = BeautifulSoup("<div></div>")
document = soup.div

then you read each lines of you text.. and then do that whenever you have text.

document.append(line)

if the line starts with a `*``

ul = document.new_tag('ul')
document.append(ul)
document = ul

then push all the li on the document... and once you end up reading *, just pop the parent so the document gets back to the div. And keep doing that... you can even do it recursively to insert ul into uls.

Once you parsed everything... you can do

str(document)

or

document.prettify()

Edit

just realized that you weren't editing the html but a unformatted text.. You could try using markdown then.

http://daringfireball.net/projects/markdown/

edited Jul 08 '12 at 15:38

answered Jul 08 '12 at 14:48

Loïc Faure-Lacroix

13,220
6
67
99

2

BeautifulSoup parses text out of html, question was how to format text AS html. unless BeautifulSoup has some feature I don't know about? – Francis Yaconiello Jul 08 '12 at 14:51
2

The page says: 'Beautiful Soup is a Python library for pulling data out of HTML and XML files'. So as long as it doesn't do the opposite (pull data out of non-tree-structured document), how could this help me? Plus, I'd rather avoid using another library just for this task, if possible. – user1102018 Jul 08 '12 at 14:54
i'm not sure what you mean by `deceived`? I think you may be using a bad translator/translation. – Francis Yaconiello Jul 08 '12 at 14:58
BeautifulSoup is a DOM manipulator. You can build your dom with it. And he can use that to parse the text. For example, on every new lines he adds text to the document... if he encounter `*` he adds a `ul` and then `li` until he encounter a new line that doesn't start with `*` and using the dom he can pop out of the `ul`... and so on. – Loïc Faure-Lacroix Jul 08 '12 at 15:24
2

Also you do NOT want to use regex to solve this problem: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Mikko Ohtamaa Jul 08 '12 at 16:05

score 1 · Accepted Answer · edited Aug 01 '14 at 10:52

1

After playing with some ideas, I've decided to go with a second regex. So basically, after running the first regex (from my original post, that creates the <li> tags), I run:

r = re.compile(r'(<li>.*?</li>\n(?!\s*<li>))', re.DOTALL)
r.sub('<ul>\\1</ul>', string_with_li_tags)

This will find the first match of <li> tag and the last match of </li>\n combo, not followed by a <li> tag (which essentially means the entire list) and add <ul> tags.

EDIT: I modified the regex a bit so it won't be greedy. This way it can handle multiple lists in the same document. Only requirement is that there are no spaces between list items, as @Aprillion mentioned below

EDIT 2: Modified the negative lookahead to treat spaces between list items as well, so all cases are covered

edited Aug 01 '14 at 10:52

Aprillion

21,510
5
55
89

answered Jul 08 '12 at 17:16

user1102018

4,369
6
26
33

1

it works because `.*` is greedy and it matches all document, than backtracks 1 character and so on until it matches `\n` afterward - the negative lookahead would work for any `\n
` with spaces between list items, so it does no good in your regex

Aprillion

Jul 08 '12 at 17:25

@deathApril, I tried also with spaces between list items, as so: __"some text\n

item 1

\n

item 2

\n

item 3

\nsome more text"__, and it works as well (it doesn't display well, but there are spaces after each `\n`) – user1102018 Jul 08 '12 at 18:12

Python Regex - Identifying the first and last items in a list

3 Answers3