find record in 10 million rows

Question

I have a text file in comma separated format. Each row has two columns and each column has integer values. Like

12334,23433
23234,45663
234422,324545
324543,23433
143233,23433
.....

The values in second columns are repeated. What I need to accomplish is that find all the values in first columns for which second column has same value and represent them in a row. Like for above data:

23433 12334,324543,143233
45663 23234
324545 234422

What I have done is following:

Imported text file into SQL Server table using comma (,) as delimiter.
Read text file from code per line.
Split line on base of comma(,) and used second column value to send query to SQL Table.
Store result in dictionary data structure where key is second column and concatenate all the results form first column value.
After all processing, traverse dictionary and write it in a file.

This is definitely taking too much time. I have written code in C#. Any solution in T-SQL will work as well.

Any help to optimize it.

I don't know anything about this but I would assume you would be able to cut your checking time largely if you just check the first byte in each number and illuminate everything else put it in a new table, then release the original table (free memory) then repeat the process with the second byte, and so on until you have done it for the number of bytes of your maximum number size? — Albert Renshaw, Feb 10 '13 at 17:04
@vcsjones - query is simple select statement with where clause equal to value for 2nd column. Thanks for reply. — Malik, Feb 10 '13 at 17:10
@AlbertRenshaw - Any more details. As per my understanding, I thought of reading file in chunks but the main cost is i have to query as many times as number of rows. Thanks for reply. — Malik, Feb 10 '13 at 17:11
10 million rows? That's not large data :P -- "Small Data is when is fit in RAM. Big Data is when is crash because is not fit in RAM." [@DEVOPS_BORAT](https://twitter.com/DEVOPS_BORAT/status/299176203691098112) — CodesInChaos, Feb 10 '13 at 17:18
With proper C# code the most expensive part of this should be reading from the db and writing to the file. — CodesInChaos, Feb 10 '13 at 17:20
@Malik Why create an index for the first column? You're grouping/sorting by the second column, so the second column needs the index. — CodesInChaos, Feb 10 '13 at 17:43
@CodesInChaos - Oppss. I thought indexing on first column will fetch result quickly. Got your point. Thanks again for reply. — Malik, Feb 10 '13 at 17:51

CodesInChaos · Answer 1 · 2013-02-10T17:35:57.977

3

Doing this in pure C# should be fast and easy, as long as the data is small. No need to pull your CSV file into an SQL database.

As long as the file fits into your RAM, the biggest cost should be IO, and not the processing. With 10 million lines, the file should have about 100 MB and take perhaps a second to process.

var lines = File.ReadLines(inputFilename);

var table = lines.Select(line => line.Split(','));
var groups = table.GroupBy(columns => columns[1]);
var output = groups.Select(g => g.Key + " " + string.Join(",", g.Select(columns=>columns[0])));

File.WriteAllLines(outputFilename, output);

edited Feb 10 '13 at 17:35

answered Feb 10 '13 at 17:23

CodesInChaos

106,488
23
218
262

Thanks a lot for your reply. I have more than 10 million rows. Will above code take less time? – Malik Feb 10 '13 at 17:32
@Malik As long as it fits into your RAM this code should be fast. If it doesn't fit into RAM you'll either get an exception or your computer will start swapping. If you get an exception, switch to 64 bit. – CodesInChaos Feb 10 '13 at 17:36
1

Converting the strings to int before the `GroupBy` step might reduce the required memory a bit. – CodesInChaos Feb 10 '13 at 17:42
I might actually add .ToList() to the end of line 2. This might reduce the number of times String.Split is run... – Aron Feb 10 '13 at 20:11
@Aron Why? It splits once per line with the current code. Like all LINQ functions, `GroupBy` iterates only once over the input sequence. Adding `ToList()` will increase memory use without improving anything. – CodesInChaos Feb 10 '13 at 20:12
My worry is the GroupBy might make multiple passes of the table.Enumerable – Aron Feb 10 '13 at 20:13
However if memory is a worry, you could use this http://stackoverflow.com/a/286553/1808494 to stream each line from the file one at a time (but ToList() will deffo be a bad idea). Then even memory wouldn't be much of an issue... – Aron Feb 10 '13 at 20:16
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/24285/discussion-between-aron-and-codesinchaos) – Aron Feb 10 '13 at 20:17
@CodesInChaos - I am getting error on the second list line. string.Join(",", g.Select(columns=>columns[0]))); - invalid argument. Any help? – Malik Feb 11 '13 at 05:05
@Malik I ran that code on your example data, and it worked. If you're talking about an `ArgumentException`, then I don't really see where that'd come from. If there is a line that doesn't contain a `,` then I'd expect an `ArgumentOutOfRangeException` – CodesInChaos Feb 11 '13 at 09:27
@CodesInChaos - Sorry for late reply. It seems that I am getting error on File.ReadLines(..). "System.IO.File" does not contain a definition for 'ReadLines'. When I change it to 'ReadAllLines', I got error at second list line. string.Join(",", g.Select(columns=>columns[0]))); - invalid argument. Any help? – Malik Feb 12 '13 at 11:34
@Malik Are you using .net 3.5? My code requires .net 4.0. In that case use the code @Aron linked to read the file, and add a `.ToArray()` call on the second parameter of the join. – CodesInChaos Feb 12 '13 at 11:36
@CodesInChaos - Also, I do have a text file with data same as posted in my question. Number of line are 16522439. – Malik Feb 12 '13 at 11:37
@CodesInChaos - That was quick. I can switch to .Net 4.0. Let me try it on .Net 4.0 and will get back soon. Thanks again for such an amazing reply time. – Malik Feb 12 '13 at 12:05
@CodesInChaos - Thanks a lot. Worked like a charm. Thanks again. – Malik Feb 13 '13 at 06:33

paparazzo · Answer 2 · 2013-02-10T17:25:54.197

1

just use order by to process one col2 at at time

select col1, col2 
from table  
order by col2, col1

then just write out the line when you get a new value for col2

Int col2Last = 0;  // assume 0 is not a valid value
StringBuilder sb = new string builder();

    while (rdr.read());
    {
        col1 = rdr.GetInt(0);
        col2 = rdr.GetInt(1);
        if(col2 != col2Last and col2Last !=0)
        {
             Console.WriteLine(col2Last.ToString() + " " + sb.ToString());
             sb.clear();
        }
        if (sb.Lenght > 0) sb.Append(",");
        sb.Append(col1.ToString());
        col2Last = col2;
    }
    Console.WriteLine(col2Last.ToString() + " " + sb.ToString());

edited Feb 10 '13 at 17:25

answered Feb 10 '13 at 17:15

paparazzo

44,497
23
105
176

Thanks a lot for your reply but i am not sure about "write out" part. Any details please. – Malik Feb 10 '13 at 17:23
But I would try the answer from CodesInChaos and eliminate SQL. 10 million should still fit in memory. There is a 1 GB limit on object size. – paparazzo Feb 10 '13 at 17:34
Don't forget creating an index over col2 so the sorting is fast. – CodesInChaos Feb 10 '13 at 17:40
@CodesInChaos I don't agree. Creating the index would take longer than a single scan. – paparazzo Feb 10 '13 at 17:56

score 0 · Answer 3 · answered Feb 11 '13 at 16:21

0

If you have imported data into table yet, why not to try this way:

declare @t table(c1 int, c2 int)
insert into @t values
(12334,23433),
(23234,45663),
(234422,324545),
(324543,23433),
(143233,23433)

select c2, replace((select cast(c1 as varchar) as 'data()' from @t where c2=t.c2 for xml path('')),' ',', ')
from 
@t t
group by c2

answered Feb 11 '13 at 16:21

msi77

1,602
1
11
10

Thanks a lot for your reply but i have a text file with 16522439 line. I can't run insert. I created table using SQL import / export wizard. Any SQL query? – Malik Feb 12 '13 at 11:36
This is an example only. Use your table created instead of @t in my select statement. – msi77 Feb 13 '13 at 06:53

find record in 10 million rows

3 Answers3