0

I had two big text files to compare, each file contains about 100,000 lines, each line represent a single entity in the db, and it's data.

Using c#. To compare initially, I just spitted the file by lines, then second spitted each line into a dictionary, and then compare the values by the key from each file content. This was working fine, but looked to me a little awkward since each line is 'stupid' and I have less control on what each split is representing, aliasing, etc. Then I decided to represent each line as an object, with naming, properties, etc,. since then, it's cleaner code, easier to control, but performance wise, it takes about 8 minutes compare to less than a minute with the previous way.

I wanted to know, If moving to creating objects out of every line is the right way (programming wise), or in cases like this, 'stupid' splitting, looping and comparing text will be the 'cleaner' way ?

Update of the purpose: I changed my code to objecting the lines, because each splitting in line has it's own 'settings' for example, one line will be an amount that looks like 00100, then i want to parse it to int and only then to compare, some splits are 'to ignore', each split also has a name (base amount, company, etc), so I want to report the name of the split if there's a difference.. My doubt is, if changing code that runs in like 20 seconds, to a code that runs in 10 minutes, but makes my life easier, is the right thing?

Roni Axelrad
  • 399
  • 3
  • 13
  • 1
    What are you trying to accomplish? Looking for duplicated lines? Lines with the same key but different values? New records to add from one file to the other? – Joel Coehoorn Sep 28 '17 at 13:45
  • That completely depends on what you´re doing with the lines afterwards. If you´re just scanning the file for errors that stay on reading it line by line and process the strings. If you want to actually do something with the data, that you should probably create some instances of a class. – MakePeaceGreatAgain Sep 28 '17 at 13:47
  • I'm trying to take each line, find the key, and compare the whole line to the referenced line of the other file. Each line represent an 'entity' and i'm trying to compare the matching entities content from each files. – Roni Axelrad Sep 28 '17 at 13:51
  • @JoelCoehoorn, lines with the same key, but different values.. but to check the different values, i have to split the line itself.. each split might represent different 'data' that need to be treated differently while compring – Roni Axelrad Sep 28 '17 at 14:01
  • Will lines from the first file always have a match in the second? Are the files sorted in any way? – Joel Coehoorn Sep 28 '17 at 14:04
  • @JoelCoehoorn 99% percent of them will have a match, but one of the issues I check is if a match is missing, there'e about 100 out of the 100,000, the files are not sorted at all. – Roni Axelrad Sep 28 '17 at 14:26
  • Sounds like [diff](https://en.wikipedia.org/wiki/Diff_utility)-tool too me. Do you also want to determine moved/added/deleted part? See [this](https://stackoverflow.com/q/138331/1997232). – Sinatr Sep 28 '17 at 14:34
  • @Sinatr Thanks, but I couldn't find diff tools that instead of just comparing line to line, know how to map the lines with a key, then comparing the right lines. Also, if I need to split a line into a seperate section, give it a name, give it other attributes, it is not possible.. – Roni Axelrad Sep 28 '17 at 16:42

2 Answers2

-1

A simple principle to follow is the idea that you have to build what you need, not what looks pretty.

If you need to manipulate the data you're reading, then using the data to populate objects makes sense to me. But if you only need to compare entries than there is no reason to make the entries any 'smarter' than that.

I would however recommend that if you think the code is ugly or otherwise feeling clunky, you try and look for alternatives and pick what you think is best. Especially for comparing you might want to look more deeply into linq or regex for example.

SpiritBH
  • 329
  • 3
  • 13
  • The thing is, that I changed my code to objecting the lines, because each splitting in line has it's own 'settings' for example, one line will be an amount that looks like 00100, then i want to parse it to int and only then to compare, some splits are 'to ignore', each split also has a name (base amount, company, etc), so I want to report the name of the split if there's a difference.. – Roni Axelrad Sep 28 '17 at 13:54
  • Based on this I would say that converting to objects is helping. Maybe you could improve performance by converting only those that have differences. – ogomrub Sep 28 '17 at 13:58
-1

In my opinion, maintainability and readability are more important. Once you achieve this you can always improve the performance.

I have seen many "early" performance optimizations that are not needed at all and just makes everything more complicated.

ogomrub
  • 146
  • 2
  • 7