I need to check all files(especially "*.docx") from a directory with size of about 10 GB and filter the names of the document with tables in it. For each file in the directory I need to iterate through Document elements of file to find out if the opened document has a table. I need to get this done in C#. I am from testing domain but they gave me development kind of task. Please help
Asked
Active
Viewed 79 times
0
-
2We are a Q/A, [tour]. You are giving us requirement like if it was a freelancer platform. I will recommend reading [ask], and [mre]. If you don't know where to start its just because you didn't specify enought your requirement. Do it step by step. First [search the document with your extention](https://stackoverflow.com/questions/3152157/find-a-file-with-a-certain-extension-in-folder), check it size https://stackoverflow.com/questions/3750590, etc etc.. – Drag and Drop Jan 20 '21 at 09:03
-
Do it step by step, and come back when you have a specific question. I really recommend giving [ask] a try . On those broad requirement It will help the specification and finding the next step. – Drag and Drop Jan 20 '21 at 09:06
-
using System; using System.IO; namespace proactivetable { class Program { static void Main(string[] args) { string[] files = Directory.GetFiles("D:\\Data", "*.docx", SearchOption.AllDirectories); foreach (string name in files) { Console.WriteLine(name); } } } -I have get all of the files from the directory next step is I have to check for Tables in it .If there is a table I should note the name of the docx seperately. – shiva Jan 20 '21 at 09:20
-
And do each step independantly. don't try to look for a table on a brunch of file with the extention etc. Take one file with only a table or a few simple tables. And use that to focus on "find table in docx C#". Always try to narrow down things into [mre]. This part should be https://stackoverflow.com/questions/11240933/extract-table-from-docx. with `Any()` – Drag and Drop Jan 20 '21 at 09:22
-
May you use the [edit] button and add those information in the question? – Drag and Drop Jan 20 '21 at 09:24
-
Does this answer your question? [c# LINQ: Filter List of Files based on File Size](https://stackoverflow.com/questions/1494602/c-sharp-linq-filter-list-of-files-based-on-file-size) – Self Jan 20 '21 at 09:37
1 Answers
0
You can use DocumentFormat.OpenXml
nuget package to access docx files and find the table inside each file.
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace ConsoleApp2
{
class Program
{
static void Main(string[] args)
{
var files = FindFilesWithTable("<path_to_directory>");
foreach (var file in files)
{
Console.WriteLine(file);
}
}
static List<string> FindFilesWithTable(string directory)
{
// filter all docx files
var files = Directory.GetFiles(directory, "*.docx");
var filesWithTable = new List<string>();
foreach (var file in files)
{
try
{
// open file in read only mode
using (WordprocessingDocument doc = WordprocessingDocument.Open(file, false))
{
// find the first table in the document.
var hasTable = doc.MainDocumentPart.Document.Body.Elements<Table>().Any();
if (hasTable)
{
filesWithTable.Add(file);
}
}
}
catch(Exception ex)
{
Console.WriteLine("Cannot process {0}: {1}", file, ex.Message);
}
}
return filesWithTable;
}
}

user2250152
- 14,658
- 4
- 33
- 57
-
I will use `DirectoryInfo` instead cause it return `FileInfo` usefull for filtering on size. `var result = new DirectoryInfo(@"c:\path") .GetFiles("*.extention", SearchOption.MySearchoption) .Where(f => f.Length > 10_737_418_240) .Where({using(){ return hasfile}}) .Select(x=> x.Name)` – Drag and Drop Jan 20 '21 at 09:36
-
@DragandDrop I thought that the size of the directory is about 10 GB. – user2250152 Jan 20 '21 at 09:41
-
System.IO.InvalidDataException: 'Central Directory corrupt-It caused this exception. {using (WordprocessingDocument doc = WordprocessingDocument.Open(file, false))}--for this line – shiva Jan 20 '21 at 09:46
-
-
-
@shiva, Time to try to isolate the issue. Create an not corrupted doc. pass it path directly to this line and see if you get the same error. – Drag and Drop Jan 20 '21 at 09:53
-
@DragandDrop---IOException: An attempt was made to move the file pointer before the beginning of the file. : 'D:\Data\DefectID_SD12065_1.docx' This was the error – shiva Jan 20 '21 at 09:56
-
@shiva You can wrap using (word...) {} in try-catch blocks and you will see how many files cannot be processed. I've edited answer – user2250152 Jan 20 '21 at 10:00
-
@user2250152The program '[8868] proactive table.exe' has exited with code 0 (0x0). The program ran error free but output is not fetched. – shiva Jan 20 '21 at 10:07
-
@DragandDropThanks to u man!! U guys are awesome and the replys are real quick – shiva Jan 20 '21 at 10:13