2

I'm working with a txt or htm file. And currently I'm looking up the document char by char, using for loop, but I need to look up the text word by word, and then inside the word char by char. How can I do this?

for (int i = 0; i < text.Length; i++)
{}
John Saunders
  • 160,644
  • 26
  • 247
  • 397
Hurrem
  • 193
  • 2
  • 4
  • 15
  • You need a way of delimiting words within your file. Whitespace would potentially work, but i can see issues with punctuation etc... – DGibbs Mar 05 '13 at 17:05
  • Use regular expressions to match on a pattern which presents a word. Then search the match char by char – Alan Mar 05 '13 at 17:06
  • What do you class as a word? Specifically when look at an html file? – Ash Burlaczenko Mar 05 '13 at 17:06
  • Yes that's what I was thinking about whitespaces, but it's getting harder when working with html files. So, I thought may be somebody have better solution – Hurrem Mar 05 '13 at 17:07
  • define "word", what about hyphens, apostrophes and other non letters? – Jodrell Mar 05 '13 at 17:08
  • 2
    @Alan i wouldn't recommend [parsing a .htm file with a regular expression.](http://stackoverflow.com/a/1732454/1895201). – DGibbs Mar 05 '13 at 17:08
  • read all the file then use split with rules you choose, you may need some regular expressions. – Alaa Jabre Mar 05 '13 at 17:08
  • This is called Tokenization: http://en.wikipedia.org/wiki/Tokenization. It can be a complex subject, depending on your source input. – Polyfun Mar 05 '13 at 17:08
  • Word is regular word that we use to express our mind :) So I want to get the text word by word without thouse tags. I"m not searching, I just convert the text to something else in that file. – Hurrem Mar 05 '13 at 17:09
  • If you want to get the text content of an html stream you will need an HTML parser, don't try and do this with a regex. – Jodrell Mar 05 '13 at 17:10
  • 2
    http://htmlagilitypack.codeplex.com/ is a good HTML parser to use with .Net – Jodrell Mar 05 '13 at 17:13
  • @DGibbs Why not? He also said he wants to parse a text file, which a regular expression should be fine for. Depending on how he defines a word, it may work for both. He's not trying to parse a HTM file, he is trying to search for characters in a "word" These are different things – Alan Mar 05 '13 at 17:24
  • 2
    @Alan It would probably work fine for a text file, but i think it's safe to assume that his .htm file contains HTML markup, which would become very tricky to parse with a regular expression. – DGibbs Mar 05 '13 at 17:25

8 Answers8

5

A simple approach is using string.Split without argument(splits by white-space characters):

using (StreamReader sr = new StreamReader(path)) 
{
    while (sr.Peek() >= 0) 
    {
        string line = sr.ReadLine();
        string[] words = line.Split();
        foreach(string word in words)
        {
            foreach(Char c in word)
            {
                // ...
            }
        }
    }
}

I've used StreamReader.ReadLine to read the entire line.

To parse HTML i would use a robust library like HtmlAgilityPack.

carla
  • 1,970
  • 1
  • 31
  • 44
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
2

You can split the string on whitespace, but you will have to deal with punctuation and HTML markup (you said you were working with txt and htm files).

string[] tokens = text.split(); // default for split() will split on white space
foreach(string tok in tokens)
{
    // process tok string here
}
toby
  • 885
  • 3
  • 10
  • 21
1

Here's my implementation of lazy extension to StreamReader. The idea is not to load the entire file into memory especially if your file is a single long line.

public static string ReadWord(this StreamReader stream, Encoding encoding)
{
    string word = "";
    // read single character at a time building a word 
    // until reaching whitespace or (-1)
    while(stream.Read()
       .With(c => { // with each character . . .
            // convert read bytes to char
            var chr = encoding.GetChars(BitConverter.GetBytes(c)).First();

            if (c == -1 || Char.IsWhiteSpace(chr))
                 return -1; //signal end of word
            else
                 word = word + chr; //append the char to our word

            return c;
    }) > -1);  // end while(stream.Read() if char returned is -1
    return word;
}

public static T With<T>(this T obj, Func<T,T> f)
{
    return f(obj);
}

to use simply:

using (var s = File.OpenText(file))
{
    while(!s.EndOfStream)
        s.ReadWord(Encoding.Default).ToCharArray().DoSomething();
}
K. R.
  • 1,220
  • 17
  • 20
0

use text.Split(' ') to split it by space into an array of words then iterate through that.

So

foreach(String word in text.Split(' '))
   foreach(Char c in word)
      Console.WriteLine(c);
mdubez
  • 3,024
  • 1
  • 17
  • 10
0

You could split on whitespaces:

string[] words = text.split(' ')

will give you an array of words, then you can iterate across them.

foreach(string word in words)
{
    word // do something with each word
}
Steve's a D
  • 3,801
  • 10
  • 39
  • 60
0

I think you can use split

         var  words = reader.ReadToEnd().Split(' ');

or use

foreach(String words in text.Split(' '))
   foreach(Char char in words )
0

You can get all the text from some HTML with the HTMLAgilityPack. If you think this is overkill look here.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);

foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    var nodeText = node.InnerText;
}

then you can split each nodes text contents into words, once you define what a word is.

Maybe like this,

using HtmlAgilityPack;

static IEnumerable<string> WordsInHtml(string text)
{
    var splitter = new Regex(@"[^\p{L}]*\p{Z}[^\p{L}]*");

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(text);

    foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
    {
        foreach(var word in splitter.Split(node.InnerText)
        {
            yield return word;
        }
    }
}

Then, to examine the chars in each word

foreach(var word in WordsInHtml(text))
{
    foreach(var c in word)
    {
        // a enumeration by word then char.
    }
}
Community
  • 1
  • 1
Jodrell
  • 34,946
  • 5
  • 87
  • 124
0

What's about regexps?

using System;
using System.Linq;
using System.Text.RegularExpressions;

namespace ConsoleApplication58
{
    class Program
    {
        static void Main()
        {
            string input =
                @"I'm working with a txt or htm file. And currently I'm looking up the document char by char, using for loop, but I need to look up the text word by word, and then inside the word char by char. How can I do this?";
            var list = from Match match in Regex.Matches(input, @"\b\S+\b")
                       select match.Value; //Get IEnumerable of words
            foreach (string s in list) 
                Console.WriteLine(s); //doing something with it
            Console.ReadKey();
        }
    }
}

it works with any delimeters and it's the fastest way to do it afaik.

Psilon
  • 35
  • 4