1

I need to read through a document to look for the occurrences of the word "apple". Once "apple" is found, I need to return the entire paragraph that contains this word. Is there a way to do this in C#? Can this be done at all?

Sure, returning the sentence that contains "apple" is fairly straight-forward, but I am not sure of what needs to be done in order to retrieve an entire paragraph. Do paragraphs have identifiable delimiters that I can use along with a regular expression?

To reiterate:

  • An entire PDF document needs to be analyzed.
  • When the word "apple" is found, return the paragraph that contains it.
  • do this for each occurrence of the word "apple".
LillaTheHun
  • 131
  • 1
  • 13
  • Do they have two newlines before and two newlines after with no non-whitespace characters between those lines? If so you could use that as a delimiter. – Spencer Wieczorek Aug 10 '18 at 23:04
  • What kind of "document"? There's countless varieties. – itsme86 Aug 10 '18 at 23:04
  • @itsme86 Left out the document type without thinking about it. Added to my post that it is a PDF – LillaTheHun Aug 10 '18 at 23:07
  • 1
    I'm not sure on an answer, but the first place I'd look is [iTextSharp](https://www.nuget.org/packages/iTextSharp/). It has some pretty cool PDF functionality. – itsme86 Aug 10 '18 at 23:11
  • 2
    Related SO Post: [Identify Paragraphs of PDF Files Using iTextSharp](https://stackoverflow.com/questions/36491429/identify-paragraphs-of-pdf-fiiles-using-itextsharp) – agillgilla Aug 10 '18 at 23:13
  • 2
    As @itsme86, if you didn't parse the PDF document yet, you can use iTextSharp to do so. However, it wouldn't be a very simple task to retrieve the paragraph directly from the PDF (if that's what you're trying to do). Anyways, [here's a good place to start](https://stackoverflow.com/questions/8846653/how-to-get-the-particular-paragraph-in-pdf-file-using-itextsharp-in-c). – 41686d6564 stands w. Palestine Aug 10 '18 at 23:14

1 Answers1

1

Usually sentences are separated by a space which is in Regex equivalent to \s or a carriage return line feed (CRLF) which is \r\n sometimes a line is also one single \n.

Let's assume that paragraphs are separated by two (or more) CRLF (or LF). Once we have the paragraphs, we can search for any word that we want inside those paragraphs:

   private static ArrayList paragraphs(string entireText)
    {
        char[] sep1 = new char[] { '\r', '\n', '\r', '\n' };
        char[] sep2 = new char[] { '\n', '\n' };

        string[] chunks = entireText.Split(sep1);
        ArrayList paragraphs = new ArrayList();
        foreach (string chunk in chunks)
        {

            string[] paras = chunk.Split(sep2);
            foreach (string paragraph in paras)
                paragraphs.Add(paragraph.Trim());
        }
        return paragraphs;
    }

   public static void Main()
        {
            string entireText = "your_text";
             ArrayList paragraphs = paragraphs(entireText);
            ArrayList containingWordList = new ArrayList();
            foreach (String paragraph : paragraphs){
                if (paragraph.Contains(word)) containingWordList.Add(paragraph);
            }
        }
Alan Deep
  • 2,037
  • 1
  • 14
  • 22