I'm reading a PDF file and split whole text on the base of delimiter ('.') but that PDF also contains author names like this;
Similar to the work of Valenzuela et al. [1] and Zhu et al. [2], we use features like citations from citing to cited paper, citations per section, and author overlap.
and my code split this one line into 3 like this
- Similar to the work of Valenzuela et al
- [1] and Zhu et al
- [2], we use features like citations from citing to cited paper, citations per section, and author overlap
Here is my code to read pdf text and split it;
from tika import parser
import re
rege x = re.compile(r'\[\d]')
objFile = parser.from_file('read.pdf')
text = objFile['content']
lstString = text.strip()
lstString = lstString.split(".")
Can anyone help me how can I avoid author name from splitting?