0

Well, my question is very simple, but I didn't find the proper answer in nowhere. What I need is to find a way that reads a .txt file, and if there's a duplicated line, remove ALL of them, not preserving one. For example, in a .txt contains the following:

1234
1233
1232
1234

The output should be:

1233
1232

Because the code has to delete the duplicated line, all of them. I searched all the web, but it always point to answers that removes duplicated lines but preserve one of them, like this, this or that.

I'm afraid that the only way to do this is to read the x line and check the whole .txt, if it finds an equal result, delete, and delete the x line too. If not, change to the next line. But the .txt file I'm checking has 50 milions lines (~900Mb), I don't know how much memory I need to do this kind of task, so I appreciate some help here.

Community
  • 1
  • 1
Edie Johnny
  • 513
  • 2
  • 5
  • 14
  • I would be interested if you can test my solution... Specially regarding its memory footprint :) – Denis Leger May 16 '16 at 19:26
  • Does the order of the lines in the output file matter? – Mike May 16 '16 at 19:34
  • Thank you for the effort, your code can work with small files, but I can't test your code with 50 millions entries because I don't have enough memory for that... – Edie Johnny May 16 '16 at 19:35
  • @Mike, no, the order doesn't matter. – Edie Johnny May 16 '16 at 19:36
  • from a php perspective, as I use native functions, this should be quicker and more efficient, but regarding the memory, I think there is no secret here, if the file is too big, you can have memory issue. I think the challenge here was to have the quickest and the more memory efficient PHP code. So far (with pure PHP), I think my solution is probably the best. (perhaps there are other possibilities) – Denis Leger May 16 '16 at 19:41
  • If order doesn't matter, just do `sort inputfile.txt | uniq -u > outputfile.txt` from the command line. No need for PHP. – Mike May 16 '16 at 19:47
  • if PHP is not part of the question, there are many options, yes. But the question was with PHP. – Denis Leger May 16 '16 at 19:50
  • @DenisLeger, the problem with your answer is that I have to put 50M lines into an array. I need at least 5Gb of RAM, while Barmar answer doesn't take 50Mb of RAM. – Edie Johnny May 16 '16 at 20:12
  • @EdieJohnny What about my comment above? Would that work for you? – Mike May 16 '16 at 20:18
  • @Mike, I run Windows, it doesn't work and I can't test it, sorry... – Edie Johnny May 16 '16 at 20:24

3 Answers3

3

Read the file line by line, and use the line contents as the key of an associative array whose values are a count of the number of times the line appears. After you're done, write out all the lines whose value is only 1. This will require as much memory as all the unique lines.

$lines = array();
$fd = fopen("inputfile.txdt", "r");
while ($line = fgets($fd)) {
    $line = rtrim($line, "\r\n"); // ignore the newline
    if (array_key_exists($line, $lines)) {
        $lines[$line]++;
    } else {
        $lines[$line] = 1;
    }
}
fclose($fd);
$fd = fopen("outputfile.txt", "w");
foreach ($lines as $line => $count) {
    if ($count == 1) {
        fputs($fd, "$line" . PHP_EOL); // add the newlines back
    }
}
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • It should be noted that this way OP will need quite a lot of RAM if there aren't many duplicates, since he has 900MB of data. Also if lines are on average long enough it may be better to use line's hash instead of its content itself as array key. – Jakub Matczak May 16 '16 at 18:41
  • @dragoste True, but it's better than the answer that reads the entire file into memory before removing all the duplicates. Since his example has short lines, I decided not to bother showing a solution that's more appropriate for long lines. – Barmar May 16 '16 at 18:43
  • Yes indeed, it's better, and that's why I've chosen your answer to point out my thoughts. – Jakub Matczak May 16 '16 at 18:46
  • It gave me an weird error: `Notice: Undefined index: 1,18,50,52,27,56` – Edie Johnny May 16 '16 at 18:55
  • I used a wrong function, it should be `array_key_exists`, not `in_array`. – Barmar May 16 '16 at 19:00
  • Alright, the outputfile.txt had no data at the end of execution. It means that all of the lines has at least one duplicate of itself? Well, I'll test it with some examples with unique lines, if it works, I choose your answer because it was the only one that worked with minimum resources of memory and CPU. Thank you a lot for your efforts. – Edie Johnny May 16 '16 at 19:13
  • You can verify it with the Unix command, but it may be very slow: `sort inputfile.txt | uniq -u | wc -l` This will show how many lines there should be in the output file. – Barmar May 16 '16 at 19:15
  • Well, it doesn't work, I wrote the following to test on `res.txt`: `1,2,3,4` `1,2,3,3` `1,2,3,2` `1,2,3,4` And the `outputfile.txt` got the same data, not only: `1,2,3,3` `1,2,3,2` **Note:** The spaces is newlines. [Here](http://pastebin.com/TxHKb6sf) is the pastebin of the code. – Edie Johnny May 16 '16 at 19:25
  • I just tried your code with that input file, and the output file is correct. – Barmar May 16 '16 at 19:37
  • I suspect your last line doesn't have a newline after it. I've changed my code so it trims whitespace from the lines when reading, and then adds newlines when writing. – Barmar May 16 '16 at 19:39
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/112081/discussion-between-edie-johnny-and-barmar). – Edie Johnny May 16 '16 at 19:43
0

I doubt there is one and only one function that does all of what you want to do. So, this breaks it down into steps...

First, can we load a file directly into an array? See the documentation for the file command

$lines = file('mytextfile.txt');

Now, I have all of the lines in an array. I want to count how many of each entry I have. See the documentation for the array_count_values command.

$counts = array_count_values($lines);

Now, I can easily loop through the array and delete any entries where the count>1

foreach($counts as $value=>$cnt)
  if($cnt>1)
    unset($counts[$value]);

Now, I can turn the array keys (which are the values) into an array.

$nondupes = array_keys($counts);

Finally, I can write the contents out to a file.

file_put_contents('myoutputfile.txt', $nondupes);
kainaw
  • 4,256
  • 1
  • 18
  • 38
  • Are you planning to execute this code in CLI or in Browser ? I would recommend executing in CLI not to hit the PHP memory and time limits. – Ambroise Maupate May 16 '16 at 18:55
  • @AmbroiseMaupate If he is parsing 50 million lines through the web, he will hit either a memory or time limit with a default install. That is his fault. – kainaw May 16 '16 at 18:58
  • What if he adds all this data to a SQL database. Then he could create a simple SQL query to find duplicated entries. With only a SQLite one it won't require any additional setup (if he setup php-sqlite extension). – Ambroise Maupate May 16 '16 at 19:18
  • @AmbroiseMaupate ...or he could use command-line utilities: `cat file | sort | uniq -u` – kainaw May 16 '16 at 19:22
0

I think I have a solution far more elegant:

$array = array('1', '1', '2', '2', '3', '4'); // array with some unique values, some not unique

$array_count_result = array_count_values($array); // count values occurences

$result = array_keys(array_filter($array_count_result, function ($value) { return ($value == 1); })); // filter and isolate only unique values

print_r($result);

gives:

Array
(
    [0] => 3
    [1] => 4
)
Denis Leger
  • 213
  • 2
  • 7