0

I need to check all files(especially "*.docx") from a directory with size of about 10 GB and filter the names of the document with tables in it. For each file in the directory I need to iterate through Document elements of file to find out if the opened document has a table. I need to get this done in C#. I am from testing domain but they gave me development kind of task. Please help

shiva
  • 5
  • 4
  • 2
    We are a Q/A, [tour]. You are giving us requirement like if it was a freelancer platform. I will recommend reading [ask], and [mre]. If you don't know where to start its just because you didn't specify enought your requirement. Do it step by step. First [search the document with your extention](https://stackoverflow.com/questions/3152157/find-a-file-with-a-certain-extension-in-folder), check it size https://stackoverflow.com/questions/3750590, etc etc.. – Drag and Drop Jan 20 '21 at 09:03
  • Do it step by step, and come back when you have a specific question. I really recommend giving [ask] a try . On those broad requirement It will help the specification and finding the next step. – Drag and Drop Jan 20 '21 at 09:06
  • using System; using System.IO; namespace proactivetable { class Program { static void Main(string[] args) { string[] files = Directory.GetFiles("D:\\Data", "*.docx", SearchOption.AllDirectories); foreach (string name in files) { Console.WriteLine(name); } } } -I have get all of the files from the directory next step is I have to check for Tables in it .If there is a table I should note the name of the docx seperately. – shiva Jan 20 '21 at 09:20
  • And do each step independantly. don't try to look for a table on a brunch of file with the extention etc. Take one file with only a table or a few simple tables. And use that to focus on "find table in docx C#". Always try to narrow down things into [mre]. This part should be https://stackoverflow.com/questions/11240933/extract-table-from-docx. with `Any()` – Drag and Drop Jan 20 '21 at 09:22
  • May you use the [edit] button and add those information in the question? – Drag and Drop Jan 20 '21 at 09:24
  • Does this answer your question? [c# LINQ: Filter List of Files based on File Size](https://stackoverflow.com/questions/1494602/c-sharp-linq-filter-list-of-files-based-on-file-size) – Self Jan 20 '21 at 09:37

1 Answers1

0

You can use DocumentFormat.OpenXml nuget package to access docx files and find the table inside each file.

using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace ConsoleApp2
{
    class Program
    {
        static void Main(string[] args)
        {
            var files = FindFilesWithTable("<path_to_directory>");

            foreach (var file in files)
            {
                Console.WriteLine(file);
            }
    }

    static List<string> FindFilesWithTable(string directory)
    {
        // filter all docx files
        var files = Directory.GetFiles(directory, "*.docx");
        var filesWithTable = new List<string>();
        foreach (var file in files)
        {
            try
            {
                // open file in read only mode
                using (WordprocessingDocument doc = WordprocessingDocument.Open(file, false))
                {
                    // find the first table in the document.  
                    var hasTable = doc.MainDocumentPart.Document.Body.Elements<Table>().Any();
                    if (hasTable)
                    {
                        filesWithTable.Add(file);
                    }
                }
            }
            catch(Exception ex)
            {
                Console.WriteLine("Cannot process {0}: {1}", file, ex.Message);
            }
        }
        return filesWithTable;
    }
}
user2250152
  • 14,658
  • 4
  • 33
  • 57
  • I will use `DirectoryInfo` instead cause it return `FileInfo` usefull for filtering on size. `var result = new DirectoryInfo(@"c:\path") .GetFiles("*.extention", SearchOption.MySearchoption) .Where(f => f.Length > 10_737_418_240) .Where({using(){ return hasfile}}) .Select(x=> x.Name)` – Drag and Drop Jan 20 '21 at 09:36
  • @DragandDrop I thought that the size of the directory is about 10 GB. – user2250152 Jan 20 '21 at 09:41
  • System.IO.InvalidDataException: 'Central Directory corrupt-It caused this exception. {using (WordprocessingDocument doc = WordprocessingDocument.Open(file, false))}--for this line – shiva Jan 20 '21 at 09:46
  • @shiva Do you access local or network directory? – user2250152 Jan 20 '21 at 09:48
  • @user2250152 Local Directory. – shiva Jan 20 '21 at 09:49
  • @shiva, Time to try to isolate the issue. Create an not corrupted doc. pass it path directly to this line and see if you get the same error. – Drag and Drop Jan 20 '21 at 09:53
  • @DragandDrop---IOException: An attempt was made to move the file pointer before the beginning of the file. : 'D:\Data\DefectID_SD12065_1.docx' This was the error – shiva Jan 20 '21 at 09:56
  • @shiva You can wrap using (word...) {} in try-catch blocks and you will see how many files cannot be processed. I've edited answer – user2250152 Jan 20 '21 at 10:00
  • @user2250152The program '[8868] proactive table.exe' has exited with code 0 (0x0). The program ran error free but output is not fetched. – shiva Jan 20 '21 at 10:07
  • @DragandDropThanks to u man!! U guys are awesome and the replys are real quick – shiva Jan 20 '21 at 10:13