0

I'm looking to compare two big sets of csv files and/or a csv file and a .txt file. I "think" the .txt file may need to be converted to a csv file just for simplicity sake but that may or may not be needed. I either want to use excel, c++, or python. I need to compare one "accepted" value list to a list that is measured and find the difference between them if there is one. Excel may be the easiest way to do this but python or c++ may work just as well. This is not homework so don't worry about that sort of thing. Code advice and/or templates is greatly appreciated. or links to websites

EDIT 1

I've read about Python's difflib or differ class but unfamiliar how to use it and may be more than I want.

EDIT 2

The Files both will have a series of columns(not with lines drawn between them or anything) and below those "named" columns there will be numbers. I need to compare the number in column 1 spot one in file one to column 1 spot one of file 2 and if there is a difference show the difference in another csv file

animuson
  • 53,861
  • 28
  • 137
  • 147
  • I think I would convert everything to CSV files, (something like this http://pypi.python.org/pypi/xlrd/0.5.2 would help) and compare the CSVs. Using CSV as the common denominator thats application and platform agnostic – tMC May 31 '11 at 18:18
  • It's a bit tricky to answer without more info - can you explain how the data is laid out in the files? – Thomas K May 31 '11 at 18:32
  • a heading for each of 5 columns under each column there are numbers and I need to compare each of the numbers under the 5 columns to numbers in the same spot on a different file and find a difference if there is one –  May 31 '11 at 18:35
  • "*I either want to use excel, c++, or python*" Well gee, that sure narrows it down. Pick one. – ildjarn May 31 '11 at 19:34
  • @ildjarn if you don't have anything to add to the answering of my question stay away please and I did that to bring in people from other programming backgrounds. You forgot to complain about me tagging python, c++, and excel as well –  May 31 '11 at 21:26
  • @Tyler31 : Actually that complaint was implied. ;-] Maybe you should be more receptive to criticism given the fact that you're not getting any high-quality answers. – ildjarn May 31 '11 at 21:32
  • ^^ what a waste of time. and arrogant –  Jun 01 '11 at 02:34
  • plain `awk` (or even `tr`) and `diff` would solve this problem with a (short) one liner in any unix-like environment. There are unix-like environments on windows. – Alexandre C. Jun 01 '11 at 08:49

2 Answers2

2

You can use ADO (ODBC/JET/OLEDB Text Driver) to treat 'decent' .txt/.csv/.tab/.flr files as tables in a SQL Database from every COM-enabled language. Then the comparisons could be done using the power of SQL (DISTINCT, GROUP, (LEFT) JOINS, ...).

Added with regard to your comment:

It's your problem and I don't want to push you where you don't want to go. But SQL is a good (the best?) tool, if you need to compare tabular data. As evidence the output of a script that spots the differences in two .txt files:

======= The .txt files to play with
------- file1.txt
"AC";"AM"
40000;-19083,00
40100;20000,00
40200;350004,00
40300;3498,99

------- file2.txt
"AC";"AM"
40000;-19083,00
40300;3498,99
40105;-234567,00
40200;350,00

======= Some diagnostic SQL
------- <NULL> indicates: In F1 but not in F2 (LEFT JOIN)
SELECT T1.AC, T1.AM, T2.AM FROM [file1.txt] AS T1 LEFT JOIN [file2.txt] AS T2 ON (T1.AC =
T2.AC)
------- Result
AC      File1   File2
40000   -19083  -19083
40100   20000   <NULL>
40200   350004  350
40300   3498,99 3498,99

------- <NULL> indicates: Not in the other file (LEFT JOIN, UNION)
SELECT T1.AC, T1.AM, T2.AM FROM [file1.txt] AS T1 LEFT JOIN [file2.txt] AS T2 ON (T1.AC =
T2.AC) UNION SELECT T2.AC, T1.AM, T2.AM FROM [file2.txt] AS T2 LEFT JOIN [file1.txt] AS T1
 ON (T1.AC = T2.AC)
------- Result
AC      File1   File2
40000   -19083  -19083
40100   20000   <NULL>
40105   <NULL>  -234567
40200   350004  350
40300   3498,99 3498,99

------- the problems: missing, different values
SELECT T1.AC, T1.AM, T2.AM FROM [file1.txt] AS T1 LEFT JOIN [file2.txt] AS T2 ON (T1.AC =
T2.AC) WHERE T2.AM IS NULL OR T1.AM <> T2.AM UNION SELECT T2.AC, T1.AM, T2.AM FROM [file2.
txt] AS T2 LEFT JOIN [file1.txt] AS T1 ON (T1.AC = T2.AC) WHERE T1.AM IS NULL OR T1.AM <>
T2.AM
------- Result
AC      File1   File2
40100   20000   <NULL>
40105   <NULL>  -234567
40200   350004  350

Further additions:

This article deals with ADO and text files; look for a file adoNNN.chm (NNN=Version number, e.g. 210) on your computer; this is a good book about ADO.

You can use Access or OpenOffice Base to experiment with SQL statements applied to a linked/referenced (not imported!) text database.

A script/program will be easy after you mastered the initial hurdle: connecting to the the database, i.e. to a folder containing the files and a schema.ini file to define the structure of the files=tables.

The output above was generated by:

  Const adClipString = 2

  Dim oFS  : Set oFS = CreateObject( "Scripting.FileSystemObject" )
  Dim sDir : sDir    = oFS.GetAbsolutePathName( ".\txt" )
  Dim oDB  : Set oDb = CreateObject( "ADODB.Connection" )
  oDB.Open "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & sDir & ";Extended Properties=""text"""
  Dim sSQL
  Dim sFiNa
  WScript.Echo "=======", "The .txt files to play with"
  For Each sFiNa In Array( "file1.txt", "file2.txt"  )
      WScript.Echo "-------", sFiNa
      WScript.Echo oFS.OpenTextFile( "txt\" & sFiNa ).ReadAll()
  Next

  WScript.Echo "=======", "Some diagnostic SQL"
  Dim aSQL
  For Each aSQL In Array( _
       Array(   "<NULL> indicates: In F1 but not in F2 (LEFT JOIN)" _
              , Join( Array( _
                     "SELECT T1.AC, T1.AM, T2.AM FROM" _
                   , "[file1.txt] AS T1" _
                   , "LEFT JOIN [file2.txt] AS T2 ON (T1.AC = T2.AC)" _
                ), " " ) ) _
     , Array(   "<NULL> indicates: Not in the other file (LEFT JOIN, UNION)" _
              , Join( Array( _
                     "SELECT T1.AC, T1.AM, T2.AM FROM" _
                   , "[file1.txt] AS T1" _
                   , "LEFT JOIN [file2.txt] AS T2 ON (T1.AC = T2.AC)" _
                   , "UNION" _
                   , "SELECT T2.AC, T1.AM, T2.AM FROM" _
                   , "[file2.txt] AS T2" _
                   , "LEFT JOIN [file1.txt] AS T1 ON (T1.AC = T2.AC)" _
                ), " " ) ) _
     , Array(   "the problems: missing, different value" _
              , Join( Array( _
                     "SELECT T1.AC, T1.AM, T2.AM FROM" _
                   , "[file1.txt] AS T1" _
                   , "LEFT JOIN [file2.txt] AS T2 ON (T1.AC = T2.AC)" _
                   , "WHERE T2.AM IS NULL OR T1.AM <> T2.AM" _
                   , "UNION" _
                   , "SELECT T2.AC, T1.AM, T2.AM FROM" _
                   , "[file2.txt] AS T2" _
                   , "LEFT JOIN [file1.txt] AS T1 ON (T1.AC = T2.AC)" _
                   , "WHERE T1.AM IS NULL OR T1.AM <> T2.AM" _
                ), " " ) ) _
     )
     sSQL = aSQL( 1 )
     WScript.Echo "-------", aSQL( 0 )
     WScript.Echo sSQL
     Dim oRS : Set oRS = oDB.Execute( sSQL )
     WScript.Echo "------- Result"
     WScript.Echo Join( Array( "AC", "File1", "File2" ), vbTab )
     WScript.Echo oRS.GetString( adClipString, , vbTab, vbCrLf, "<NULL>" )
  Next
  oDB.Close

If you delete/ignore the fat (create SQL statements, diagnostics output), it boils down to 6 lines

  Dim oDB  : Set oDb = CreateObject( "ADODB.Connection" )
  oDB.Open "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & sDir & ";Extended Properties=""text"""
  sSQL = "..."
  Dim oRS : Set oRS = oDB.Execute( sSQL )
  WScript.Echo oRS.GetString( adClipString, , vbTab, vbCrLf, "<NULL>" )
  oDB.Close

which can be 'ported' easily to every COM-enabled language, because the ADO objects do all the heavy lifting. The .GetString method comes handy, when you want to save a resultset: just twiddle the separator/delimiter/Null arguments and dump it to file

  oFS.CreateTextFile( ... ).WriteLine oRS.GetString( _
    adClipString, , ",", vbCrLf, ""
  )

(don't forget to add a definition for that table to your schema.ini). Of course you also can use a "SELECT/INSERT INTO", but such statements may not be easy to get right/passed the ADO Text Driver's parser.

Addition wrt Computations:

Start with a 5 x 2 master/approved file containing:

Num0    Num1    Num2    Num3    Num4
7,6     6,1     3,8     0,9     8,9
0,9     9,4     4,7     8,8     9,9

transform it to expected.txt

Num0    Num1    Num2    Num3    Num4    Spot
7,6     6,1     3,8     0,9     8,9     1
0,9     9,4     4,7     8,8     9,9     2

by appending the Spot column so it conforms to

[expected.txt]
ColNameHeader=True
CharacterSet=1252
Format=Delimited(;)
Col1=Num0 Float
Col2=Num1 Float
Col3=Num2 Float
Col4=Num3 Float
Col5=Num4 Float
Col6=Spot Integer

in your schema.ini file. Similarly, transform a measure file like:

Num0    Num1    Num2    Num3    Num4
7,1     1,1     3,8     0,9     8,9
0,9     9,4     4,7     8,8     9,9

to measured.txt

Num0    Num1    Num2    Num3    Num4    Spot
7,1     1,1     3,8     0,9     8,9     1
0,9     9,4     4,7     8,8     9,9     2

Apply

  sSQL = Join( Array( _
         "SELECT E.Num0 - M.Num0 AS Dif0" _
      ,       ", E.Num1 - M.Num1 AS Dif1" _
      ,       ", E.Num2 - M.Num2 AS Dif2" _
      ,       ", E.Num3 - M.Num3 AS Dif3" _
      ,       ", E.Num4 - M.Num4 AS Dif4" _
      ,       ", E.Spot          AS Spot" _
      ,  "FROM [expected.txt] AS E" _
      ,  "INNER JOIN [measured.txt] AS M" _
      ,  "ON E.Spot = M.Spot" _
  ), " " )

Write the resultset to differences.txt

aFNames = Array( "Num0", ... "Spot" ) oFS.CreateTextFile( sFSpec ).Write _ Join( aFNames, sFSep ) & sRSep & oRS.GetString( adClipString, , sFSep, sRSep, "" )

and you get:

Num0    Num1    Num2    Num3    Num4    Spot
0,5     5       0       0       0       1
0       0       0       0       0       2
Ekkehard.Horner
  • 38,498
  • 2
  • 45
  • 96
  • I didn't ask about SQL sorry :s –  May 31 '11 at 19:10
  • Cool, that was a neat trick! What do you use as a diagnostic SQL console? – zvrba May 31 '11 at 20:46
  • That is interesting... So you think I can do this and output the difference in another file? Any website suggestions? or files you can send me to mess with with instructions on how to do that? Thanks for the info(sorry for the prior statement) :p –  Jun 01 '11 at 03:40
  • I understand somewhat what you are doing here by showing that there is "some" difference in the files but what I'm really after is I'm measuring a value and it's different than the given "true" value and I want to import a measured value's file and a true value's file and have it spit out the "remainder" so to speak into another file or something I can export or save as a csv file –  Jun 01 '11 at 13:28
1

You don't need to code, you can make separators the same in both files (spaces or commas) using replace function in text editor and compare it using graphical diff tool from TortoiseSVN: http://tortoisesvn.net/

dbf
  • 6,399
  • 2
  • 38
  • 65
  • How does TortoiseSVN work or do I need to just read the documentation?? –  May 31 '11 at 18:59
  • http://tortoisesvn.net/docs/release/TortoiseSVN_en/tsvn-dug-diff.html - compare two files – dbf May 31 '11 at 21:05