Java regex from pdf file read

Question

I have a java project which it will read text from pdf files. The pdf contain tabular format which will contain breakline if the column span the text content. Eg: "This is www.google.com" become "This is www.goog/nle.com" (spanned to next line). I will need to extract this text out and process it using domain regex pattern. It won't get a proper "www.google.com" if it spanned. I couldn't replace the "/n" as I might have scenario eg: "This is an This is www.google.com/nwww.yahoo.com".

*This pdf file is converted from a docx which if java read from docx it is getting www.google.com fine without the breakline issue. It happen only in pdf.

Any thought? Thanks

score 0 · Answer 1 · answered May 25 '15 at 15:22

0

You could remove all line-breaks first and the apply a regex like described here to find all URLs.

answered May 25 '15 at 15:22

Alexander Pacha

9,187
3
68
108

You shouldn't have downvoted me. Please read my question properly as I wouldn't ask if I can remove the breakline. :) – Luke.T May 26 '15 at 02:00
I voted your question neither up nor down. However: Solving this problem with a regular expression is not a good way, because the expression might become very complicated and not maintainable. Just try out simple expression, e.g. on this site: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx – Alexander Pacha May 26 '15 at 08:10

Java regex from pdf file read

1 Answers1