This is a real situation I met, where every book's information should be extracted. In the original text, every book's information is separated from other text by ENTER.
Every book has a title. But author/format/... informations can be omitted; however, if any of them is presented, it can be separated by ENTER or WHITESCPACE. And the most difficult part for me is the information can be addressed in arbitrary order, so, let me put an example:
title: book 1
author: Mike
Language: Eng
format: pdf
pages: 12
some other text
author: Jack
title: book 2
some other text 2
title: book 3 pages: 300
should be recognized as 3 books. In my desired python code:
for item re.findall("title: .{1,} ((?=.*Author: .{1,}) ){0,1} ((?=.*language: .{1,}) ){0,1} ((?=.*format: .{1,}) ){0,1}. ((?=.*pages: .{1,}) ){0,1} \n", subject, re.IGNORECASE | re.VERBOSE):
print('Unknown author' or item['Author'])
print(item['title'])
print('Unknown pages' or item['Author'])
print('\n')
# what I expected is
Mike
book 1
12
Jack
book 2
Unknown pages
Unknown author
book 3
300
Please note 2 things:
for
book 2
, the author is ahead of the title in the text, which is what I mean by usingarbitrary order
for
book 3
, the page information is not put on a new line. Since all the heading(author:
,title:
, and so on, sorry I don't know how to call it in English) will not appear in other information, it is safe said it is not a book with 300 pages
I have read Regex: I want this AND that AND that... in any order, mimic and get the above regular expression. But as you know it is wrong:
import re
subject = '''
title: book 1
author: Mike
Language: Eng
format: pdf
pages: 12
some other text
author: Jack
title: book 2
some other text 2
title: book 3 pages: 300
'''
result = re.findall("title: .{1,} ((?=.*Author: .{1,}) ){0,1} ((?=.*language: .{1,}) ){0,1} ((?=.*format: .{1,}) ){0,1}. ((?=.*pages: .{1,}) ){0,1} \n", subject, re.IGNORECASE | re.VERBOSE)
for i in result:
print(i)
produces
('', '', '', '')
('', '', '', '')
('', '', '', '')
so any help? Thanks