0

I have extracted text using tika for some pdf files and stored the text in text files. Now i want to parse these files using opennlp Chunk parser, but i was unable to parse the file lines because it contains some special characters in it(some square type symbols)without space between word to word, sample line in my text file(unable to show those square type symbols, diacritic symbols)

51.2.3  Troubleshooting DHCP Configuration  ?
62  Module 3: Point-to-Point Protocol (PPP) ?
62.1    Configuring HDLC Encapsulation  ?

So i want to get the lines as

Troubleshooting DHCP Configuratin
Module 3: Point-to-Point Protocol(PPP)
Configuring HDLC Encapsulation

Please suggest me how to do this?

user2609542
  • 801
  • 4
  • 13
  • 20

2 Answers2

0
  1. Read the file line by line.
  2. Replace the unwanted Chars in each of these lines with "": line = line.replaceAll("^\\d{2}(\\.\\d)+ +", "").replaceAll(" +\\?$", "");
  3. Write the file using FileWriter.

This asumes that the number format at the beginning of the lines is dd(.d)* where d is one digit and each section after the first one has only one digit. Otherwise the regex has to be changed to fit your format.

Remove the cryptic symbols by appending .replaceAll("[æ╚]", ""); adding all of these characters into the square brackets. Ensure you have the right encoding. If you read the file with "UTF-8" you have to copy these caracters in an editor where you can specify that this file is "UTF-8".

Community
  • 1
  • 1
Marc von Renteln
  • 1,229
  • 15
  • 34
  • Hi,My lines are not in particular format, i can not write a reg-ex, is there any other solution – user2609542 Jul 23 '13 at 08:36
  • You can still use regex if there is not specific format but specific characters. To remove all non-printable characters use `replaceAll('[^\\p{Print}]', "")`. To replace specific characters use the replace method above to list the characters. You can even remove erverything that is not in A-Za-z0-9 with `replaceAll('[\\W]', "")`. – Marc von Renteln Jul 23 '13 at 08:47
0

Would replacing all non-word characters with whitespace be enough, or at least a step in the right direction?

str = str.replaceAll("\\W+", " ");
Joni
  • 108,737
  • 14
  • 143
  • 193