I have extracted text using tika for some pdf files and stored the text in text files. Now i want to parse these files using opennlp Chunk parser, but i was unable to parse the file lines because it contains some special characters in it(some square type symbols)without space between word to word, sample line in my text file(unable to show those square type symbols, diacritic symbols)
51.2.3 Troubleshooting DHCP Configuration ?
62 Module 3: Point-to-Point Protocol (PPP) ?
62.1 Configuring HDLC Encapsulation ?
So i want to get the lines as
Troubleshooting DHCP Configuratin
Module 3: Point-to-Point Protocol(PPP)
Configuring HDLC Encapsulation
Please suggest me how to do this?