C# Pointer in a HTML file

Question

The objective:

Create a marker in a HTML file for the "end" of the file in C#

I am iterating through a file in C# and taking the table values and storing them in a database. The file grows and grows everyday, the results being added in are just being appended to the report.

In C# I have a StreamReader that iterates through each file and places values in a list for insert into a SQL database.

string file_path = @"C:\This\Is\The\Way";
string[] files = Directory.GetFiles(file_path, "*.html", SearchOption.TopDirectoryOnly);
List<string> file_contents = new List<string>();
int* file_pointer; // Create my pointer
foreach(string file in files)
{
    using (StreamReader reader = new StreamReader(file))
    {
        string line = string.Empty;
        while ((line = reader.ReadLine()) != null)
        {
            // Remove <td> and </td>
            line = line.Substring(4);
            line = line.Remove(4);
            // store in a list
            file_contents.Add(line);

            // Would I leave a pointer here ? then just keep moving it after every line until the end?
            // The thought for line.Contains("</td>") is that the last line, 
            // </html> will just be moved to the end and the pointer wont show new rows. 
            // this way it goes to last </td>
            if(line.Contains("</td>")
                file_pointer += 1;
         }
    }
    foreach (string item in file_contents)
        // SQL insert, not implemented yet
    file_contents.Clear();
}

"Test" Data

<table>
<tr>
<th>Test Unique to File</th>
<th>Data Point A Unique to Test</th>
<th>Data Point B Unique to Test</th>
<th>Time Taken</th>
<th>Pass/Fail</th>
</tr>
<tr>
<td>Test A Run</td>
<td>Data Point A</td>
<td>Data Point B</td>
<td>213 seconds</td>
<td>Pass</td>
</tr>
<tr>
<td>Test B Run</td>
<td>Data Point A</td>
<td>Data Point B</td>
<td>3333 seconds</td>
<td>Fail</td>
</tr>
<tr>
<td>Test C Run</td>
<td>Data Point A</td>
<td>Data Point B</td>
<td>12 seconds</td>
<td>Pass</td>
</tr>
</table>

The goal is for once the html report is parsed & inserted into the database, a position is marked in that html file. Then say tomorrow, another 6 rows are added. Instead of iterating through the whole file and just verifying that everything about to be inserted isn't a duplicate, we can just pick up where we left off.

I'm not sure the best way to go about this and would appreciate some guidance.

Question: Would creating a pointer at the end of every be the best bet?

it would help if you provide an example of table row, maybe there is some better approach to do that — Yehor Androsov, Feb 25 '20 at 19:00
This is a very strange and inefficient storage system. Why are you appending HTML to the same file repeatedly instead of storing them as separate files? — , Feb 25 '20 at 19:01
also maybe you don't need a pointer? i mean, if you have some identifier from db in report, you could just get latest key from db, and read your file backwards, until you find your key. this way you don't need pointer at all — Yehor Androsov, Feb 25 '20 at 19:02
This code won't compile. You are confusing "memory" pointers (int* pointer) with line pointers (or indexers, int currentLine). Memory pointers will work only in unsafe context and pointer arithmetics is extremely dangerous. You'll probably want to go for int pointer (without asterisc). — Claudio Valerio, Feb 25 '20 at 19:03
@Amy I cant get into the process that makes the HTML file and change it to go straight to the database. The HTML file also I think is able to be appended incase of network crashing — Travis, Feb 25 '20 at 19:04
@pwrigshihanomoronimo I can make a sample set of a data, will edit once done. — Travis, Feb 25 '20 at 19:05
@ClaudioValerio not sure what im doing for pointers :) never used one so just flying by the seat of my pants here — Travis, Feb 25 '20 at 19:06
@Travis if you can't get into the html file generation process, it means you can't be sure it won't override your line marker, if you save it in the same file. You should save it somewhere else. Proposition: save current line marker in {filename}.llm. llm stands for Last Line Marker. In your code, when loading a {filename}.html file, try read {filename}.llm. If you find it, convert content to int and skip same amount of lines from {filename}.html. — Claudio Valerio, Feb 25 '20 at 19:11
I will look into the Last Line Marker, thanks for the idea @ClaudioValerio — Travis, Feb 25 '20 at 19:15
I would use Python (or similar and regex) for this. Regardless, look into [StreamReader and seeking](https://stackoverflow.com/a/5404324/70317) — Theofanis Pantelides, Feb 25 '20 at 19:21
Does this answer your question? [StreamReader and seeking](https://stackoverflow.com/questions/5404267/streamreader-and-seeking) — Theofanis Pantelides, Feb 25 '20 at 19:22
Looks interesting for sure @TheofanisPantelides . I'll create a test right now and see if it does! — Travis, Feb 25 '20 at 19:28
This sounds like a bad idea - surely the HTML file must reset at some point or it would grow to fill the hard drive? — NetMage, Feb 25 '20 at 19:32
@NetMage I'm not sure how the file resets. Have not been here long enough & the guys to ask are OOO — Travis, Feb 25 '20 at 19:34
I think I would recommend always parsing the entire file, and comparing the data to the database and only adding new rows. This would re-process data repeatedly but would handle resets and unexpected re-orderings without issue. — NetMage, Feb 25 '20 at 19:55

score 2 · Accepted Answer · answered Feb 25 '20 at 21:10

Unless you want to create your html parser (that is a pain in the a**, believe me), I think this could be a solution, that doesn't need line markers. I'm using HtmlAgilityPack nuget package:

Install-Package HtmlAgilityPack

Then my code would look like this:

private static void Run(string[] files)
{
    var sql = $@"
        MERGE TestsResults AS target  
        USING (SELECT @TestId, @DataPointA, @DataPointB, @Duration, @Result) AS source (TestID, DataPointA, DataPointB, Duration, Result)  
        ON (target.TestID = source.TestID)  
        WHEN MATCHED THEN
            UPDATE SET DataPointA = source.DataPointA, DataPointB = source.DataPointB, Duration = source.Duration, Result = source.Result
        WHEN NOT MATCHED THEN  
            INSERT (TestID, DataPointA, DataPointB, Duration, Result)  
            VALUES (source.TestID, source.DataPointA, source.DataPointB, source.Duration, source.Result);";
    // Get database connection
    var dbConnection = GetDbConnection();
    // Create base command from sql
    var cmd = dbConnection.CreateCommand();
    cmd.CommandType = CommandType.Text;
    cmd.CommandText = sql;
    var paramsNames = new[] { "@TestId", "@DataPointA", "@DataPointB", "@Duration", "@Result" };

    foreach (var file in files)
    {
        // Using HtmlAgilityPack to load the document
        var doc = new HtmlAgilityPack.HtmlDocument();
        doc.Load(file);
        // Getting all TR from document, skipping first with header
        var rowsCollection = doc.DocumentNode.SelectNodes("//tr").Skip(1);
        foreach (var row in rowsCollection)
        {
            // Getting all TDs in current row and converting contents to string[]
            var values = row.SelectNodes("td").Select(el => el.InnerText).ToArray();

            // insert or update in database: each row is a new DbParameter set for our DbCommand
            cmd.Parameters.Clear();
            for (int i = 0; i < paramsNames.Length; i++)
            {
                var param = cmd.CreateParameter();
                param.ParameterName = paramsNames[i];
                param.DbType = DbType.String;
                param.Value = values[i];
                cmd.Parameters.Add(param);
            }
            // Run the command, will update existing tests and adding new ones
            cmd.ExecuteNonQuery();
        }
    }
}

I'm using the power of MERGE statement for SQL Server: https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql.

Basically the MERGE statement will update matching records and inserting new ones.

Please note you could need to do some additional cleaning on the data (e.g. Duration can be converted to int, pass/fail to boolean, ...).

If you are super sure that file generator will always append rows (and will not remove or modify any of the previous worked rows), and don't want to go through already parsed lines, this is the solution:

private static void Run(string[] files)
{
    var sql = $@"
        MERGE TestsResults AS target  
        USING (SELECT @TestId, @DataPointA, @DataPointB, @Duration, @Result) AS source (TestID, DataPointA, DataPointB, Duration, Result)  
        ON (target.TestID = source.TestID)  
        WHEN MATCHED THEN
            UPDATE SET DataPointA = source.DataPointA, DataPointB = source.DataPointB, Duration = source.Duration, Result = source.Result
        WHEN NOT MATCHED THEN  
            INSERT (TestID, DataPointA, DataPointB, Duration, Result)  
            VALUES (source.TestID, source.DataPointA, source.DataPointB, source.Duration, source.Result);";
    // Get database connection
    var dbConnection = GetDbConnection();
    // Create base command from sql
    var cmd = dbConnection.CreateCommand();
    cmd.CommandType = CommandType.Text;
    cmd.CommandText = sql;
    var paramsNames = new[] { "@TestId", "@DataPointA", "@DataPointB", "@Duration", "@Result" };

    foreach (var file in files)
    {
        // Using HtmlAgilityPack to load the document
        var doc = new HtmlAgilityPack.HtmlDocument();
        doc.Load(file);
        int skipLines = 1;

        // Try to load last line marker file
        var lastLineMarkerPath = Path.ChangeExtension(file, ".llm");
        int lastWorkedLine = 0;
        if (File.Exists(lastLineMarkerPath))
        {
            var text = File.ReadAllText(lastLineMarkerPath);
            if (int.TryParse(text, out lastWorkedLine))
            {
                // if file exists and content is convertible to int, add found value to the numer of line to skip
                skipLines += lastWorkedLine;
            }
        }
        // Getting all TR from document, skipping first with header
        var rowsCollection = doc.DocumentNode.SelectNodes("//tr").Skip(skipLines);
        foreach (var row in rowsCollection)
        {
            // Getting all TDs in current row and converting contents to string[]
            var values = row.SelectNodes("td").Select(el => el.InnerText).ToArray();

            // insert or update in database: each row is a new DbParameter set for our DbCommand
            cmd.Parameters.Clear();
            for (int i = 0; i < paramsNames.Length; i++)
            {
                var param = cmd.CreateParameter();
                param.ParameterName = paramsNames[i];
                param.DbType = DbType.String;
                param.Value = values[i];
                cmd.Parameters.Add(param);
            }
            // Run the command, will update existing tests and adding new ones
            cmd.ExecuteNonQuery();
        }

        // write total row count to last line marker file
        File.WriteAllText(lastLineMarkerPath, (rowsCollection.Count() + lastWorkedLine).ToString());
    }
}

Although I would recommend you work always the whole file.

So in the first example, my table name would be `MERGE my_table AS target`, then leave source as it is? — Travis, Feb 26 '20 at 14:33
yes, substitute "TestsResults" with your actual table name. Fix other possible naming differences accordingly to your situation. — Claudio Valerio, Feb 26 '20 at 14:43

C# Pointer in a HTML file

1 Answers1