2

I want read a pdf file that contains Persian characters using itext . I read from this , but words are reverse. For example "ره" instead of "هر" . I split it with "\n" and read every text in every line from end , but i think that maybe there is a better solution to read from this Pdf . That is my code :

public class Main extends JFrame {
    private static final int WIDTH = 600;
    private static final int HEIGHT = 600;
    /**
     * by Shomeis
     */
    private static final long serialVersionUID = 1L;

    public Main() {
        Dimension dim = Toolkit.getDefaultToolkit().getScreenSize();
        int x = dim.width / 2 - WIDTH / 2;
        int y = dim.height / 2 - HEIGHT / 2;
        setBounds(x, y, WIDTH, HEIGHT);
        setDefaultCloseOperation(WindowConstants.EXIT_ON_CLOSE);
        setMinimumSize(new Dimension(600, 600));
        //
        File pdf = new File("E:\\guide1.pdf");
        if (!pdf.canRead() || !pdf.isFile()) {
            System.err.println("cannot read input file " + pdf.getAbsolutePath());
            return;
        }
        try {
            PdfReader reader = new PdfReader(pdf.getAbsolutePath());
            String page;
            String areaText = "";
            System.out.println(reader.getNumberOfPages());
            for (int k = 1; k <= reader.getNumberOfPages(); k++) {
                System.out.println(k);
                page = PdfTextExtractor.getTextFromPage(reader, k);

                String[] b = page.split("\n");

                for (int i = 0; i < b.length; i++) {
                    for (int j = (b[i].length() - 1); j >= 0; j--) {
                        areaText += b[i].charAt(j);
                    }
                    areaText += "\n";
                }
            }
            JTextArea text = new JTextArea(areaText);
            JScrollPane sc = new JScrollPane(text);
            text.setWrapStyleWord(true);
            text.setComponentOrientation(ComponentOrientation.RIGHT_TO_LEFT);
            this.setContentPane(sc);
            this.setVisible(true);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    public static void main(String[] args) throws IOException {
        // TODO Auto-generated method stub
        new Main().setVisible(true);
    }
}
albciff
  • 18,112
  • 4
  • 64
  • 89
Shomeis
  • 99
  • 5
  • what is this your PdfReader ? Does it have a way to set the character set? – Thusitha Thilina Dayaratne Aug 27 '14 at 10:53
  • "words are reverse" - some software cannot work with Persian and other RTL scripts, so they use a trick: (1) use mirrored fonts, (2) draw all text mirrored. You see the effect of (2); using Acrobat Pro, for example, you can inspect the fonts and see (1) as well. Related: [Ruby extract arabic text from PDF](http://stackoverflow.com/questions/21032994/ruby-extract-arabic-text-from-pdf/21042960#21042960) – Jongware Aug 27 '14 at 11:36
  • If your PDF is tagged (it probably isn't) then you might be able to use the tagged information for extraction. In Acrobat, go to File, Properties. On the Description tab look in the bottom left corner to see if it is tagged. Most PDFs that I run across are not tagged unfortunately. Also, see [Edit #2 on this post](http://stackoverflow.com/a/10191879/231316) for why things are backwards – Chris Haas Aug 27 '14 at 13:17

1 Answers1

0

You can reverse the words:

String res = strategy.getResultantText();
res = new StringBuilder(res).reverse().toString();
Mohsen Abasi
  • 2,050
  • 28
  • 30