1

I have been working a long time on a project using MS Access 2013. One issue i am having is i have very very long "Comments" in tables i need to break down and insert into new tables. Each comment is linked to a "RouteID" and their relationships between the two can be many to many. The main issue i am having is there are duplicate comments in the table i am moving data FROM. There is no need to keep the duplicate "Comments", the only difference in the rows is the "RouteID". Basically i have an OLD comments table and a NEW comments table.

My issue is its not correctly checking if comments from my OLD table are in the NEW table and is creating duplicates.

SOME COMMENTS ARE found to be duplicates, others are not and the size of the comments that are found to NOT BE DUPLICATES vary on size and symbols from short to very long.

Here is some code i have written, i have attempted multiple versions of SQL and VBA/VB6 code, however the result is still the same, duplicate comments are showing up in my new table. Please feel free to critique this regardless if it has to do with my issue or not.

I am aware that some queries can be far far too long to work so i have made a SQL query to compare the TABLE'S together however that also fails and duplicate comments remain. I have checked my code and i do not believe that i am doing the logic incorrectly

Please help! No one seems to know what to do in my circle of friends / professors. I have an idea to take the comments and HASH them and put them into a similar table and use that to check

If Not (rsOLD.EOF And rsOLD.BOF) Then
    rsOLD.MoveFirst
    Do Until (rsOLD.EOF = True)
        TComment = rsOLD(CommentColumn)
        TResponse = rsOLD(ResponseColumn)
        If Not IsNull(TComment) Then
            TComment = Replace(TComment, "'", "''")
            SQL = "SELECT Comment, ID FROM Comments WHERE Comment = (SELECT '" & CommentColumn & _
                  "' FROM CommentsOld WHERE (CommentsOld.ID = " & rsOLD!ID & "));"
            'SQL = "SELECT Comment FROM Comments" & _
            '      " INNER JOIN CommentsOld" & _
            '      " ON Comments.Comment = CommentsOld." & CommentColumn & _
            '      " WHERE CommentsOld.ID = " & rsOLD!ID & ";"
            Set rsCHECK = CurrentDb.OpenRecordset(SQL, dbOpenDynaset)
            If (rsCHECK.EOF And rsCHECK.BOF) Then 'IF COMMENT DOES NOT EXIST, NOTHING FOUND

I have attempted to work with a bool function that loops through a recordset, but the BigO of the loops is far to large to complete in a reasonable amount of time given the size of the records in each table.

JaredTS486
  • 436
  • 5
  • 16
  • 3
    please post only relative code and narrow the actual problem with your code –  Jan 23 '14 at 16:18
  • Are you saying that the code is failing to identify comments that really are duplicates but are not exact string matches (e.g., because of trailing whitespace, or imbedded line breaks, or similar), or is the issue more subtle than that? – Gord Thompson Jan 23 '14 at 17:42
  • I trimmed down the code some, the main check is at the bottom. Commented SQL code is something i tried but was unable to complete because Inner Join between the two caused an error due to the two being different types (from what i read) between the two tables however they are both identical. As for duplicates for instance, one that is inserted into the new table many times is "[Forest Service comment]" and i do not see any extra / trailing WS or newlines. – JaredTS486 Jan 23 '14 at 18:04
  • Basically rsCHECK.EOF And rsCHECK.BOF are always True. I also have used the DLookup() and checked it for VBnullString and also a length greater than zero and got the same results. – JaredTS486 Jan 23 '14 at 18:24

2 Answers2

1

One possible cause would be that your code is doing

SQL = "SELECT Comment, ID FROM Comments WHERE Comment = (SELECT '" & CommentColumn & _
        "' FROM CommentsOld WHERE (CommentsOld.ID = " & rsOLD!ID & "));"

so instead of returning the contents of the column whose name is in the variable CommentColumn you are returning the column name as a literal string. That is, if CommentColumn contains "Column1" then your SQL code is not doing

... (Select Column1 FROM CommentsOld ...

it is doing

... (Select 'Column1' FROM CommentsOld ...

Perhaps you should try

SQL = "SELECT Comment, ID FROM Comments WHERE Comment = (SELECT [" & CommentColumn & _
        "] FROM CommentsOld WHERE (CommentsOld.ID = " & rsOLD!ID & "));"

Edit re: comment

Since there are some significant restrictions on Memo (Long Text) fields compared with Text (Short Text) fields w.r.t. joins, DISTINCT queries (as discussed in another answer), etc. your hashing idea is starting to look increasingly appealing. There are links to some VBA/VB6 implementations of various hashing algorithms in an answer here.

Generating a hash for every comment could potentially be rather time-consuming so you'll probably only want to do it once. If you could add a [..._hash] column for each comment column (e.g., add a Short Text column named [CP1_hash] for the Long Text column [CP1]) and store the hashes in there, that would be ideal. Once the hashing was done you could compare comment hashes instead of the comments themselves. In addition, the hash columns could be joined, fully indexed, and manipulated in other useful ways.

(Yes, there would be the remote chance of a hash collision, but I think it would be extremely unlikely given the length of the strings you are likely to be processing.)

One thing you would certainly not want to do is use the hashing function itself in a WHERE clause or a JOIN condition. That would cause a table scan and force the re-calculation of all the hash values for every row, and that could really slow things down.

Community
  • 1
  • 1
Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • I get a error when using the brackets "Run-time error '3342': Invalid Memo, OLE, or Hyperlink Object in subquery '[CP1]'" (which is the name of the column, stands for Comment Period 1) i remember this now, something i tried a while back. Similar to the error i was getting when trying to use the Inner Join SQL query. – JaredTS486 Jan 23 '14 at 18:48
  • Thanks, i think that would be best, i have a hashing function set up, i hope that it can take the comments without issues. Not to mention the horrible table set up for the comments [Image](https://dl.dropboxusercontent.com/u/25728887/Public%20Linked%20Images/2014-01-23_123950.png). – JaredTS486 Jan 23 '14 at 19:42
  • I am using `Replace(string, "'", "''")` to add an escape character for the Comment before inserting it into the database. I can not decide on when i should take the Hash of the comment, before or after the escape character replacement. Any suggestions? I believe it does not matter in end... – JaredTS486 Jan 23 '14 at 20:21
  • @JaredTS486 Are you doubling up the single-quotes in preparation for "gluing together" a SQL statement that uses single quotes as string delimiters? If so, then when the text is written to the database the `''`s will be converted back to `'`s. In other words, you're not altering the comment text, you're just tweaking it so the SQL statement won't break. (But then again you know that "gluing together" SQL statements is a `Bad Idea`, right...?) – Gord Thompson Jan 23 '14 at 20:35
  • Yes the method i am using is using MS Access DAO recordsets to add the data. Such as `rs!Item = NewItem`. I looked into the param version for VB6 but i had no luck getting it to work. If you have any examples that would be great! Also my code is running through making the Hashes now, its taking it a while and i am getting a little nervous about the time this will end up taking. – JaredTS486 Jan 23 '14 at 20:47
  • @JaredTS486 If you are adding records by using a DAO Recordset then what is the reasoning behind doubling-up the `'` quotes? (By the way, Recordset inserts/updates are very similar to parameterized queries in that they are not vulnerable to SQL Injection, etc., and that is a `Good Thing`.) – Gord Thompson Jan 23 '14 at 20:54
  • 1
    Hashing the comments worked just fine :) thanks for the help! – JaredTS486 Jan 24 '14 at 18:02
0

Start with an empty table, NewTable. Then run this query:

insert into NewTable (Comment)
SELECT distinct Comment
FROM OldTable;

the 'distinct' will exclude all duplicates, so what ends up in NewTable should be unique. You can then go through OldTable and do what you like with each of the RouteIDs.

simon at rcl
  • 7,326
  • 1
  • 17
  • 24
  • 1
    In Access, SELECT DISTINCT and `Memo` (`Long Text`) fields do not play nicely together. In Access 2010 your statement fails with "The field is too small to accept the amount of data you attempted to add. Try inserting or pasting less data." (The statement does work properly if the `DISTINCT` keyword is removed.) – Gord Thompson Jan 23 '14 at 17:40
  • Hmm. I did test it in Access 2013 with a long text field (but both tables were empty). Although it didn't do anything it ran fine. Does ORDER BY have any problems with LongText? – simon at rcl Jan 23 '14 at 17:44
  • I am seeing duplicates in my new table that are fairly simple such as "[Forest Service comment]" and that is it. Perhaps i need to do some sanitization first? – JaredTS486 Jan 23 '14 at 18:06
  • Interesting, in Access 2013 the statement did not cause an error like it does in Access 2010, but it did truncate the values in [NewTable] to 255 characters. (That's a very common side-effect of SELECT DISTINCT on Memo fields in Access.) – Gord Thompson Jan 23 '14 at 18:10
  • If distinct doesn't work, then you could select them all, with an ORDER BY Comment. Then in code you check each comment doesn't = the previous comment; if it does its a dup and can be ignored; if it doesn't its a new one and gets inserted into the new table. Another option is to do select comment, count(*) from newtable group by comment having count(*) > 1 which will give you only dups; you can then process them by hand if there are a reasonable number or with vba if there are many. – simon at rcl Jan 23 '14 at 18:11
  • So i tried using Set rsOLD = CurrentDb.OpenRecordset("SELECT * FROM CommentsOld ORDER BY " & CommentColumn & ";", dbOpenDynaset) And i still get the same result, it seems to fail finding the duplicates in this case also, i have not tried the SELECT DISTINCT yet. – JaredTS486 Jan 23 '14 at 18:57
  • @GordThompson I attempted a GROUP BY SQL query here [Link](http://pastie.org/8661312#5) Oddly the query was returning NULL strings, yet the record sets EOF was true and BOF was false. In the end the new comment table was empty after i ran that code. – JaredTS486 Jan 23 '14 at 19:31