3

I am trying to extract text from a word document with the following format and inserting the data into a SQL database.

Word Document

Name of House: Aasleagh Lodge
Townland: Srahatloe
Near: Killary Harbour, Leenane
Status/Public Access: maintained, private fishing lodge
Date Built: 1838-1850, burnt 1923, rebuilt 1928

Source Code

        var wordApp = new Microsoft.Office.Interop.Word.Application();
        var wordDoc = wordApp.Documents.Open(@"C:\Users\mhoban\Documents\Book.docx");
        var txt = wordDoc.Content.Text;

        var regex = new Regex(@"(Name of House\: )(.+?)[\r\n]");

        var allMatches = regex.Matches(txt);
        foreach (Match match in allMatches)
        {
            var nameValue = match.Groups[2].Value;
            var townValue = match.Groups[2].Value;

            SqlConnection con = new SqlConnection(ConfigurationManager.ConnectionStrings["ConnectionString"].ToString());
            SqlCommand com = new SqlCommand();

            com.CommandText = "INSERT INTO Houses (Name, Townland) VALUES (@name, @town)";

            com.Parameters.Add("@name", SqlDbType.NVarChar).SqlValue = nameValue;
            com.Parameters.Add("@town", SqlDbType.NVarChar).SqlValue = townValue;

            com.Connection = con;

            con.Open();

            com.ExecuteNonQuery();

            con.Close();
        }

This works perfectly the only thing is how would I write the code to insert the other fields of text for example this line

var regex = new Regex(@"(Name of House\: )(.+?)[\r\n]");

Inserts the name of the house in this case "Aasleagh Lodge" but how would I write this line to insert the townland?

I tried replacing "Townland" in the regex with the field name I require but I end up with singular records each only holding one different column value.

Is there a way I could insert the data at the same time maybe by using a list or something so this would not occur.

New Source Code

var wordApp = new Microsoft.Office.Interop.Word.Application();
            var wordDoc = wordApp.Documents.Open(@"C:\Users\mhoban\Documents\Book.docx");
            var txt = wordDoc.Content.Text;

            using (var sr = new StringReader(txt))
            {
                var s = string.Empty;
                var nameValue = new StringBuilder();
                var townValue = new StringBuilder();
                while ((s = sr.ReadLine()) != null)
                {
                    if (s.StartsWith("Name of House"))
                    {
                        nameValue.Append(s.Split(new[] { ':' })[1].Trim());
                    }
                    else if (s.StartsWith("Townland"))
                    {
                        townValue.Append(s.Split(new[] { ':' })[1].Trim());
                    }

                    if (nameValue.Length > 0 && townValue.Length > 0)
                    {
                        SqlConnection con = new SqlConnection(ConfigurationManager.ConnectionStrings["ConnectionString"].ToString());
                        SqlCommand com = new SqlCommand();

                        com.CommandText = "INSERT INTO Houses (Name, Townland) VALUES (@name, @town)";
                        com.CommandText = "INSERT INTO Houses (Name) VALUES (@name)";

                        com.Parameters.Add("@name", SqlDbType.NVarChar).SqlValue = nameValue;
                        com.Parameters.Add("@town", SqlDbType.NVarChar).SqlValue = townValue;

                        com.Connection = con;

                        con.Open();

                        com.ExecuteNonQuery();

                        con.Close();

                        nameValue.Clear(); townValue.Clear();
                    }
                }
            }

Database Fields

[Id]          NCHAR (10)     NULL,
[Name]        NVARCHAR (MAX) NULL,
[Townland]    NVARCHAR (MAX) NULL,
[Near]        NVARCHAR (MAX) NULL,
[Status]      NVARCHAR (MAX) NULL,
[Built]       NVARCHAR (MAX) NULL,
[Description] NVARCHAR (MAX) NULL,
[Families]    NVARCHAR (MAX) NULL,
[Images]      IMAGE          NULL
  • I don't think it can be done with regex. You would need to use SingleLine option to get all the rows which removes the returns. The problem is your values have spaces between words which need to return as the terminator. I can easily do it without regex. – jdweng Apr 29 '15 at 09:28
  • @jdweng could you give me an example please? –  Apr 29 '15 at 09:40

4 Answers4

1

Here is a solution without regex. You really do not need it here.

var txt = "Name of House: Aasleagh Lodge\r\nTownland: Srahatloe\r\nNear: Killary Harbour, Leenane\r\nStatus/Public Access: maintained, private fishing lodge\r\nDate Built: 1838-1850, burnt 1923, rebuilt 1928\r\nName of House: House of Lan\r\nTownland: Another town land\r\nNear: Killary Harbour, Leenane\r\nStatus/Public Access: maintained, private fishing lodge\r\nDate Built: 1838-1850, burnt 1923, rebuilt 1928\r\nName of House: New Lodge\r\nTownland: NewTownLand\r\nNear: Killary Harbour, Leenane\r\nStatus/Public Access: maintained, private fishing lodge\r\nDate Built: 1838-1850, burnt 1923, rebuilt 1928";
using (var sr = new StringReader(txt))
{
   var s = string.Empty;
   var nameOfHouse = new StringBuilder();
   var townland = new StringBuilder();
   while ((s = sr.ReadLine()) != null)
   {
      if (s.StartsWith("Name of House"))
      {
          nameOfHouse.Append(s.Split(new[] {':'})[1].Trim());
      }
      else if (s.StartsWith("Townland"))
      {
           townland.Append(s.Split(new[] { ':' })[1].Trim());
      }

      if (nameOfHouse.Length > 0 && townland.Length > 0)
      { 
          // INSERT THE VALUES AND RESET THEM
          nameOfHouse.Clear(); townland.Clear();
      }
   }
}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I get a runtime on this line var nameValue = splts["Name of House"]; it says key does not exist in the dictionary –  Apr 29 '15 at 10:21
  • It is strange. Are you sure you posted input data as they appear in the `txt` variable? I posted the data I tested with. – Wiktor Stribiżew Apr 29 '15 at 10:22
  • Please see my screen. Please also post a picture of the `txt` variable in your question. – Wiktor Stribiżew Apr 29 '15 at 10:32
  • this is coming from a word document which contains 70000+ words the word doc example I posted is for one house only there is 570 others with the same format... I cant physically paste that much text into a variable –  Apr 29 '15 at 10:36
  • Ok, I have edited my answer where I am assuming you will have `Name of House` and `Townland` in each of the records. Otherwise, I need to know what is the record delimiter and if the number of entries in each record is the same, and if the entries may follow in different order. – Wiktor Stribiżew Apr 29 '15 at 11:19
  • @ProDiablo: The connection initialization and finalization must be performed outside the `while` loop. – Wiktor Stribiżew Apr 29 '15 at 13:35
  • Failed to convert parameter value from a StringBuilder to a String. On com.ExecuteNonQuery(); –  Apr 29 '15 at 13:39
  • So, you need to convert them into a string :( `com.Parameters.Add("@name", SqlDbType.NVarChar).SqlValue = nameValue.ToString();` and `com.Parameters.Add("@town", SqlDbType.NVarChar).SqlValue = townValue.ToString();` – Wiktor Stribiżew Apr 29 '15 at 13:41
  • its inputting an empty string into the database just in the house name field only –  Apr 29 '15 at 13:43
  • Please check https://msdn.microsoft.com/en-us/library/yy6y35y8%28v=vs.110%29.aspx. And here is another hint: http://stackoverflow.com/a/8390345/3832970. – Wiktor Stribiżew Apr 29 '15 at 13:59
0

You can use the following regex:

(.*?\: )(.+?)[\r\n]

See DEMO

Also.. you could replace [\r\n] in your regex with $ for matching last like where \r or \n are optional.

i.e:

var regex = new Regex(@"^(.*?\: )(.+?)$");
karthik manchala
  • 13,492
  • 1
  • 31
  • 55
0

Yes, this is possible. However, just making the regex generic only solves half your problem, because you also have to know what database column each value maps to.

Here is the general approach I would take:

  1. Have something that defines each parameter name you might encounter in the file and its corresponding column in the database. This could be just a dictionary in your code, but a more mature design would involve putting it in some kind of external configuration.

  2. Use a simple String.Split based on : for each line to put all of the key/value pairs into a dictionary (a regex is overkill here).

  3. Build the insert statement based on the data in the first two steps above.

  • could you give me an example of this, I haven't work with a dictionary or string.split before and every other way I have tried will wont work regex has been the closest. –  May 05 '15 at 09:01
0

Try this

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Data;
using System.Data.SqlClient;

namespace ConsoleApplication21
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.txt";
        static void Main(string[] args)
        {
            StreamReader reader = new StreamReader(FILENAME);
            string inputLine = "";
            List<TakenBMI> takenBMIs = new List<TakenBMI>();
            TakenBMI newTakenBMI = null;
            while ((inputLine = reader.ReadLine()) != null)
            {
                inputLine = inputLine.Trim();
                if (inputLine.Length > 0)
                {
                    string[] inputArray = inputLine.Split(new char[] { ':' });
                    switch (inputArray[0].Trim())
                    {
                        case "Name of House":
                            newTakenBMI = new TakenBMI();
                            takenBMIs.Add(newTakenBMI);
                            newTakenBMI.Name_of_House = inputArray[1].Trim();
                            break;
                        case "Townland":
                            newTakenBMI.Townland = inputArray[1].Trim();
                            break;
                        case "Near":
                            newTakenBMI.Near = inputArray[1].Trim();
                            break;
                        case "Status/Public Access":
                            newTakenBMI.Status_Public_Access = inputArray[1].Trim();
                            break;
                        case "Date Built":
                            newTakenBMI.Date_Built = inputArray[1].Trim();
                            break;
                    }
                }

            }

            reader.Close();

            SqlConnection con = new SqlConnection(ConfigurationManager.ConnectionStrings["ConnectionString"].ToString());
            con.Open();
            string SQL = "INSERT INTO Houses (Name, Townland, Near, Status, Built)" +
                "VALUES ('@name', '@town', '@near', '@status', '@built')";

            SqlCommand com = new SqlCommand(SQL,con);
           

            com.Parameters.Add("@name", SqlDbType.NVarChar);
            com.Parameters.Add("@town", SqlDbType.NVarChar);
            com.Parameters.Add("@near", SqlDbType.NVarChar);
            com.Parameters.Add("@status", SqlDbType.NVarChar);
            com.Parameters.Add("@built", SqlDbType.NVarChar);
 
            foreach (TakenBMI takenBMI in takenBMIs)
            {
                com.Parameters["@name"].Value = takenBMI.Name_of_House ;
                com.Parameters["@town"].Value = takenBMI.Townland;
                com.Parameters["@near"].Value = takenBMI.Near;
                com.Parameters["@status"].Value = takenBMI.Status_Public_Access;
                com.Parameters["@built"].Value = takenBMI.Date_Built;
               
                com.ExecuteNonQuery();
            }

        }
    }
    public class TakenBMI
    {
        public string Name_of_House { get; set; }
        public string Townland { get; set; }
        public string Near { get; set; }
        public string Status_Public_Access { get; set; }
        public string Date_Built { get; set; }
    }
}
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • what variables is the data that is been retrieved held in? –  Apr 30 '15 at 08:33
  • List takenBMIs = new List(); – jdweng Apr 30 '15 at 09:19
  • How would I go about inserting this data into the database using the way I have stated in the above question. –  Apr 30 '15 at 13:46
  • What are the fields in the database? – jdweng Apr 30 '15 at 14:37
  • Updated question with fields –  Apr 30 '15 at 14:40
  • Thanks. I wanted to know if you need to put all the TakenBMI in one row of database or multiple rows. I would use the code I posted as is and then after getting all the data store results from the List and store in to database by enumerating through the List<>. You will need to modify the Insert SQL to include 9 parameters. Your orignal code with modified command text should work – jdweng Apr 30 '15 at 15:28
  • could you give me an example on how to do this as I am still learning –  May 01 '15 at 08:24
  • Updated answer with additional fields – jdweng May 01 '15 at 09:23
  • this method is not inserting any data into the database, the fields remain NULL –  May 05 '15 at 08:46
  • Need more info. Usually problems like yours occur because data is going into the wrong database or wrong table in dataqbase. Make sure you are getting into the foreach loop. Verify the connection string is correct. – jdweng May 05 '15 at 09:18
  • The data originally inserts using regex so the connection to the database is working. I used your answer and I get no errors but when I check was the data inserted, all the fields in the table are null and when I debug the code its skipping over the switch statement. –  May 05 '15 at 10:40
  • I'm confused. The switch statement worked previously. I didn't change that portion of the code. Just added new code after switch statement. I suspect the text file changed. – jdweng May 05 '15 at 11:08
  • Its a word doc and the file has not been edited since January of this year –  May 05 '15 at 11:11
  • How can a word document have a txt extension? Open txt file and make sure it is the same as posted file. – jdweng May 05 '15 at 11:18
  • where did you get text extension from? this is my path "C:\Users\mhoban\Documents\Book.docx" –  May 05 '15 at 11:22
  • You said you were extracting text from word document so I simulated the text by using a txt file. Did my code ever run for you? I'm wondering if there isn't invisible tags in the text that is stopping code from running. – jdweng May 05 '15 at 12:30
  • I need a way of extracting the text and at the same time inserting the extracted data into the database, the way it is now it skips over the switch statement but if I copy some of the text from the word document and save it as a txt file it then runs but I get an error on the Townland part of the switch "Additional information: Object reference not set to an instance of an object." –  May 05 '15 at 13:14
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/76992/discussion-between-prodiablo-and-jdweng). –  May 05 '15 at 13:18
  • Code requires "Name Of House" to be the start of each record in the file. Each time "Name of House" is found in th efile a new TakenBMI instance is created and added into the takenBMIs list. – jdweng May 06 '15 at 09:55
  • okay I copied some text from the word document and pasted it into note pad and saved it as a txt file to test and it works but this will take too much time because the word doc contains 70000+ words and it would take to much time to copy and paste everything so is there a way to extract and insert from the word document instead of using a txt file or is there a way of extracting the word document data and saving the data as a txt file? –  May 06 '15 at 10:28
  • Need to clear formatting. See following webpage : http://stackoverflow.com/questions/24480564/how-to-remove-html-tags-from-word-content – jdweng May 06 '15 at 10:59