0

I'm reading a PDF file and split whole text on the base of delimiter ('.') but that PDF also contains author names like this;

Similar to the work of Valenzuela et al. [1] and Zhu et al. [2], we use features like citations from citing to cited paper, citations per section, and author overlap.

and my code split this one line into 3 like this

  • Similar to the work of Valenzuela et al
  • [1] and Zhu et al
  • [2], we use features like citations from citing to cited paper, citations per section, and author overlap

Here is my code to read pdf text and split it;

from tika import parser
import re

rege x = re.compile(r'\[\d]')

objFile = parser.from_file('read.pdf')
text = objFile['content']
lstString = text.strip()
lstString = lstString.split(".")

Can anyone help me how can I avoid author name from splitting?

Naila Akbar
  • 3,033
  • 4
  • 34
  • 76

0 Answers0