Avoid author name from splitting Python

Asked Dec 07 '19 at 17:13

Active Dec 07 '19 at 17:13

Viewed 128 times

I'm reading a PDF file and split whole text on the base of delimiter ('.') but that PDF also contains author names like this;

Similar to the work of Valenzuela et al. [1] and Zhu et al. [2], we use features like citations from citing to cited paper, citations per section, and author overlap.

and my code split this one line into 3 like this

Similar to the work of Valenzuela et al
[1] and Zhu et al
[2], we use features like citations from citing to cited paper, citations per section, and author overlap

Here is my code to read pdf text and split it;

from tika import parser
import re

rege x = re.compile(r'\[\d]')

objFile = parser.from_file('read.pdf')
text = objFile['content']
lstString = text.strip()
lstString = lstString.split(".")

Can anyone help me how can I avoid author name from splitting?

asked Dec 07 '19 at 17:13

Naila Akbar

3,033
4
34
76

3

Can I ask you a hypothetical question? What if the authors name was `J. R. R. Tolkien`? – user1558604 Dec 07 '19 at 17:17
2

Generally you can use the "split" function of the "re" module with an appropriate regular expression which can exclude some abbreviations e. g. by look-behind matches. But it will be hard to develop rules to dinguish sentence end from abbreviation reliably. – Michael Butscher Dec 07 '19 at 17:22
@user1558604 exactly.. it would be disaster – Naila Akbar Dec 07 '19 at 17:30
Would this answer help? https://stackoverflow.com/questions/4576077/python-split-text-on-sentences – user1558604 Dec 07 '19 at 17:32
nope..it is doing exactly same.. – Naila Akbar Dec 07 '19 at 17:41

Avoid author name from splitting Python

0 Answers0