-2

I have a java project which it will read text from pdf files. The pdf contain tabular format which will contain breakline if the column span the text content. Eg: "This is www.google.com" become "This is www.goog/nle.com" (spanned to next line). I will need to extract this text out and process it using domain regex pattern. It won't get a proper "www.google.com" if it spanned. I couldn't replace the "/n" as I might have scenario eg: "This is an This is www.google.com/nwww.yahoo.com".

*This pdf file is converted from a docx which if java read from docx it is getting www.google.com fine without the breakline issue. It happen only in pdf.

Any thought? Thanks

Luke.T
  • 11
  • 1
  • 6

1 Answers1

0

You could remove all line-breaks first and the apply a regex like described here to find all URLs.

Alexander Pacha
  • 9,187
  • 3
  • 68
  • 108
  • You shouldn't have downvoted me. Please read my question properly as I wouldn't ask if I can remove the breakline. :) – Luke.T May 26 '15 at 02:00
  • I voted your question neither up nor down. However: Solving this problem with a regular expression is not a good way, because the expression might become very complicated and not maintainable. Just try out simple expression, e.g. on this site: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx – Alexander Pacha May 26 '15 at 08:10