0

This is a real situation I met, where every book's information should be extracted. In the original text, every book's information is separated from other text by ENTER.

Every book has a title. But author/format/... informations can be omitted; however, if any of them is presented, it can be separated by ENTER or WHITESCPACE. And the most difficult part for me is the information can be addressed in arbitrary order, so, let me put an example:

title: book 1
author: Mike
Language: Eng
format: pdf
pages: 12

some other text

author: Jack
title: book 2

some other text 2

title: book 3 pages: 300

should be recognized as 3 books. In my desired python code:

for item re.findall("title: .{1,} ((?=.*Author: .{1,}) ){0,1} ((?=.*language: .{1,}) ){0,1} ((?=.*format: .{1,}) ){0,1}. ((?=.*pages: .{1,}) ){0,1} \n", subject, re.IGNORECASE | re.VERBOSE):
    print('Unknown author' or item['Author'])
    print(item['title'])
    print('Unknown pages' or item['Author'])
    print('\n')

# what I expected is
Mike
book 1
12

Jack
book 2
Unknown pages

Unknown author
book 3
300

Please note 2 things:

  1. for book 2, the author is ahead of the title in the text, which is what I mean by using arbitrary order

  2. for book 3, the page information is not put on a new line. Since all the heading(author:, title:, and so on, sorry I don't know how to call it in English) will not appear in other information, it is safe said it is not a book with 300 pages

I have read Regex: I want this AND that AND that... in any order, mimic and get the above regular expression. But as you know it is wrong:

import re

subject = '''
title: book 1
author: Mike
Language: Eng
format: pdf
pages: 12

some other text

author: Jack
title: book 2

some other text 2

title: book 3 pages: 300
'''


result = re.findall("title: .{1,} ((?=.*Author: .{1,}) ){0,1} ((?=.*language: .{1,}) ){0,1} ((?=.*format: .{1,}) ){0,1}. ((?=.*pages: .{1,}) ){0,1} \n", subject, re.IGNORECASE | re.VERBOSE)


for i in result:
    print(i)


produces

('', '', '', '')
('', '', '', '')
('', '', '', '')

so any help? Thanks

oyster
  • 537
  • 3
  • 15
  • What book does `some other text` and `some other text 2` belong to? – Jarad Nov 23 '19 at 03:38
  • This problem is unsolvable as stated. You'd have to have *some* unambiguous divider between the details of one book and the next. – jasonharper Nov 23 '19 at 03:58
  • @Jarad. just skip `some other text` and `some other text 2` because there is no `title: xxx` in it. – oyster Nov 23 '19 at 05:50

2 Answers2

1

If you don't have to use regular expression, you could maybe check if ': ' is in the first, I don't know, 10 characters or so. If it is, assume it's a book property. When it's not, it means the properties for a given book has ended. Therefore, you have all the properties for that book. You then add them to some kind of "final" list of books.

Your data as a string:

subject = '''
title: book 1
author: Mike
Language: Eng
format: pdf
pages: 12

some other text

author: Jack
title: book 2

some other text 2

title: book 3 pages: 300
'''

Some code:

from copy import copy

books = []
book_properties = []

lines = subject.splitlines()

for i,line in enumerate(lines, start=1):
    if ": " in line[:10]:
        book_properties.append(line)
        if i == len(lines):
            book = copy(book_properties)
            books.append(book)
    else:
        if len(book_properties) > 0:
            book = copy(book_properties)
            books.append(book)
            book_properties.clear()

print(books)

Result

[['title: book 1', 'author: Mike', 'Language: Eng', 'format: pdf', 'pages: 12'],
 ['author: Jack', 'title: book 2'],
 ['title: book 3 pages: 300']]
Jarad
  • 17,409
  • 19
  • 95
  • 154
1

It's a somewhat complicated mixed solution, I used regex, but not only that. I splitted the text into blocks, and applied regex on them.

import re

text="""

title: book 1
author: Mike
Language: Eng
format: pdf
pages: 12

some other text

author: Jack
title: book 2

some other text 2

title: book 3 pages: 300

title: Adventures of Huckleberry Finn author: Mark Twain pages: 500

title: Captain Python

"""

recs=[[]]
last=recs[-1]
for line in text.splitlines():

    line=line.strip()
    if not line:
        if not last:
            continue
        recs.append([])
        last=recs[-1]
        continue

    founds= re.findall(r"(?m)(title|author|pages):(.*?)(?:$|(?=title:|author:|pages:))",line)
    if founds and founds[0]:
        last.extend(founds)


for l in recs:
    if l:
        d={"title":"unknown", "author":"unknown", "pages":"unknown"}
        d.update( dict(l) )
        print(d) 

Output:

{'title': ' book 1', 'author': ' Mike', 'pages': ' 12'}
{'title': ' book 2', 'author': ' Jack', 'pages': 'unknown'}
{'title': ' book 3 ', 'author': 'unknown', 'pages': ' 300'}
{'title': ' Adventures of Huckleberry Finn ', 'author': ' Mark Twain ', 'pages': ' 500'}
{'title': ' Captain Python', 'author': 'unknown', 'pages': 'unknown'}
kantal
  • 2,331
  • 2
  • 8
  • 15