4

I am working on a code that can scan multiple .docx files for a keyword and then gives the whole sentence out, till a line break.

This function works great, I get every Sentence that contains the keyword till there is a line break.

My Question:

How does my RegEx have to look like when I don't want the text till the 1st linebreak, but the text up to the 2nd line break? Maybe with the right quantifier? I didn't get it to work.

My pattern: ".*" + "keyword" + ".*"

Main.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
using Xceed.Words.NET;

public class Class1
{

  static void Main(string[] args)
  {
     String searchParam = @".*" + "thiskeyword" + ".*";
     List<String> docs = new List<String>();
     docs.Add(@"C:\Users\itsmemario\Desktop\project\test.docx");

     for (int i = 0; i < docs.Count; i++)
     {
         Suche s1 = new Suche(docs[i], searchParam);
         s1.SearchEngine(docs[i], searchParam);
     }
  }
}

Suche.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
using Xceed.Words.NET;


public class Suche
{
    String path;
    String stringToSearchFor;
    List<String> searchResult = new List<String>();

    public Suche(String path, String stringToSearchFor)
    {
        this.path = path;
        this.stringToSearchFor = stringToSearchFor;
    }

    public void SearchEngine(String path, String stringToSearchFor)
    {
        using (var document = DocX.Load(path))
        {
           searchResult = document.FindUniqueByPattern(stringToSearchFor, RegexOptions.IgnoreCase);

            if (searchResult.Count != 0)
            {
                WriteList(searchResult);
            }
            else
            {
                Console.WriteLine("Text does not contain keyword!");
            }
        }
    }

    public void WriteList(List<String> list)
    {
        for (int i = 0; i < list.Count; i++)
        {
            Console.WriteLine(list[i]);
            Console.WriteLine("\n");
        }
    }
}

Expected output is like:

"*LINEBREAK* Theres nothing nicer than a working filter for keywords. *LINEBREAK*"
unsinn
  • 45
  • 6
  • 1
    `@".*" + "thiskeyword" + ".*\n.*"`? – Wiktor Stribiżew Mar 22 '19 at 08:11
  • Thanks, but poorly this is not working. I'm in C# btw, i don't really know the differences of RegEx in different languages... – unsinn Mar 22 '19 at 08:15
  • Then please 1) add a sample text and expected output, 2) provide the *reproducible* code example. – Wiktor Stribiżew Mar 22 '19 at 08:16
  • The regex is exactly for C#. It works [like this](http://regexstorm.net/tester?p=.*thiskeyword.*%5cn.*&i=2%0d%0atext+thiskeyword+text%0d%0ahere+and%0d%0athere). If it does not for you, you must either 1) be reading the file line by line, or 2) have no linebreaks in the text. – Wiktor Stribiżew Mar 22 '19 at 08:25
  • There are linebreaks, and i think it gets read line by line. I use the community version of xceed Word (https://github.com/xceedsoftware/DocX), with the function FindUniqueByPattern. Thank you! Edit: My program says "Text does not contain keyword!" when i use youre RegEx pattern (searchResult.Count = 0) . – unsinn Mar 22 '19 at 08:29

1 Answers1

1

You cannot use document.FindUniqueByPattern DocX method to match across lines because it only searches within individual paragraphs. See this source code, namely foreach( Paragraph p in Paragraphs ).

You may get the document.Text property, or combine all paragraph texts into one and search within the whole text. Remove the searchResult = document.FindUniqueByPattern(stringToSearchFor, RegexOptions.IgnoreCase); line and use

var docString = string.Join("\n", document.Paragraphs.Select(p => p.text));
// var docString = string.Join("\n", document.Paragraphs.SelectMany(p => p.MagicText.Select(x => x.text)));
searchResult = Regex.Matches(docString, $@".*{Regex.Escape(stringToSearchFor)}.*\n.*", RegexOptions.IgnoreCase)
    .Cast<Match>()
    .Select(x => x.Value)
    .ToList();
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I am going to try this! Have to fix my code right now cause List --> IEnumerable. It says "Extensions method must be defined in a non-Generic static class", but i cant make it static because i cant access it from my Main.cs anymore..... sry i am totally noob. But youre code looks awesome, thanks for your brainlard! – unsinn Mar 22 '19 at 09:24
  • 1
    @unsinn Add `.ToList()` at the end of the `Regex.Matches` call. – Wiktor Stribiżew Mar 22 '19 at 09:37
  • Oh that easy... thanks for this! When i use youre code it says again (searchResult.Count = 0). When i remove the '\n', it gives me all the text in the document. I think we are on the right way, but it was not exactly this! Thank you so much! It seems a bit complicated for me... – unsinn Mar 22 '19 at 10:03
  • @unsinn Please share your `document.Text` value. – Wiktor Stribiżew Mar 22 '19 at 10:05
  • document.Text value: START OF TEXTThis is the beginning of a long story, I try to get a phrase out of a docx document with the xceed words community edition "DocX"Thiskeyword is the keyword what im searching for.Somebody trying to help me and im really thank full for that. TEST thiskeywordTrying to filter words like test, and the phrase before, and after the testEND OF TEXT – unsinn Mar 22 '19 at 11:03
  • Its the whole text the DocX document contains – unsinn Mar 22 '19 at 11:04
  • @unsinn Are there linebreaks? – Wiktor Stribiżew Mar 22 '19 at 11:05
  • Yes there are linebreaks, and i think Exceed.Words.Net gets them right, because my RegEx worked just fine, up to the next linebreak. Just the .Text function doesnt show them – unsinn Mar 22 '19 at 11:08
  • @unsinn I added two more options. – Wiktor Stribiżew Mar 22 '19 at 11:41
  • Thanks! I will try it right now, sounds like the right solution – unsinn Mar 25 '19 at 06:18
  • What is "document.Paragraphs.Select(p => p.text)" doing? I get "Paragraph does not contain a definition for text" error... Could you help me fix this? – unsinn Mar 26 '19 at 08:36
  • @unsinn That is a LINQ expression to get all paragraph texts as a "list" to `string.Join` with `"\n"`. I have the `text` property access in the code. Try updating the package if it is outdated. – Wiktor Stribiżew Mar 26 '19 at 08:40
  • my package is up to date. Im sorry but i dont get it. When i try the 2nd version of your code it works properly, but only to the first line break though. Thanks for that quick response!!! – unsinn Mar 26 '19 at 08:54
  • 1
    @unsinn This package drops all in-paragraph line breaks when reading the XML structure. This can only be fixed in the source code. Please file a bug. – Wiktor Stribiżew Mar 26 '19 at 08:58
  • Okay... sad story. Thank you so much for your help! I will report the bug. So sad i cant get it to run properly..... – unsinn Mar 26 '19 at 09:29
  • @unsinn Probably you should use the [Microsoft.Office.Interop.Word library](https://learn.microsoft.com/en-us/dotnet/api/microsoft.office.interop.word?view=word-pia) to get the text as is. – Wiktor Stribiżew Mar 26 '19 at 09:32
  • Nice tip! I hope ill find the right functions, never heard about this namespace. Lets get to work then ;P thx – unsinn Mar 26 '19 at 09:37
  • 1
    @unsinn If you are so eager, I will give you another hint: use **late binding** with that library, in C# now, you will get best performance and portability with dynamic classes. If you have any issues let me know, I will help with this. – Wiktor Stribiżew Mar 26 '19 at 10:02
  • Awesome, thank you so much! I think theres very much to read & learn for me now before i can implement this function correctly.. getting a bit deep for my state of knowledge. I'll try my best and i'll contact you if i have any questions – unsinn Mar 26 '19 at 10:56