2

I just used iTextSharp to get all the text from a pdf, and now I need to split that text into words. I used to use Acrobat library, which automatically divided it into words (using getPageNthWord()).

I don't know which criteria used, but now I need to know how to split the text into words. I will split text in different languages, so I need to split to every possible separator char.

I saw the method Char.IsSeparator() but using that mean looping for every char, which will be innefficient.

What I've got so far is manually specify the separators to use in the .Split():

separators = " .,;:-(){}[]/\'""?¿!¡" & Convert.ToChar(9) & NewLine()

There is some place to retrieve the common separator chars?

SysDragon
  • 9,692
  • 15
  • 60
  • 89
  • 3
    At least in western languages, the separator for words is " ". You might add also some punctuation signs (".", ",", ":", ";"), just to account for any scenario (wrongly-written text), but I don't think that you should consider more than that. Otherwise you might start to "over-separate"; for example: "-" (or `"'"` or...) does not necessarily indicate two different words. – varocarbas Oct 08 '13 at 07:53
  • 1
    First, try to look at the sample at http://msdn.microsoft.com/en-us/library/cta536cf.aspx. Second, may be string.Split(null) will be satisfactory? – Vladimir Oct 08 '13 at 08:32
  • @VladimirFrolov Both comments should be answers, IMHO. `.Split(null)` [only use white space separators](http://msdn.microsoft.com/en-us/library/b873y76a.aspx), but as you pointed, is quite similar to the `Char.IsSeparator()` filter. – SysDragon Oct 08 '13 at 08:45
  • @varocarbas Shouldn't I use the new line and tab chars on the split? – SysDragon Oct 08 '13 at 08:46
  • 1
    The Tab is implicit in " " (or null as pointed out). Regarding the new line separator, I am not sure how are you getting it via iTextSharp (you might do a couple of quick tests to confirm it); neither I am sure if the blank space accounts implicitely for it under your conditions (test it). But as said, in your position I would use the Split method as restrictively as possible in order to avoid unsolvable problems (considering two words where you really have just one) and perform a post-analysis of the given words (via Replace, for example). – varocarbas Oct 08 '13 at 08:52
  • 1
    See also [How to split text into words?](http://stackoverflow.com/q/16725848) – Michael Freidgeim Aug 03 '16 at 23:35

1 Answers1

2

You can use string.Split method with null parameter:

If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters. White-space characters are defined by the Unicode standard and return true if they are passed to the Char.IsWhiteSpace method.

Or you can follow MSDN sample and get all char.IsSeparator() characters.

Vladimir
  • 7,345
  • 4
  • 34
  • 39