0

i am not programmer, but i would like some help to remove duplicate lines in a document and keep only original lines. i was trying to do this with some text processors, editpadpro, but since my file is more than 1 gigabyte, always gets frozen and can't complete the operation.

i know perl is very good at this, but i don't know how to use it, keeping in mind that the file can be over 1 or 2 gB.

example of input lines:

line 1 
line 2
line 3
line 1
line 2
line 4
line 1

example of output lines:

line 1 
line 2
line 3
line 4

i am sorry if this is very basic, but i really don't know how to proceed, most of the time i use built in functions, i hope not to annoy anyone with this question.

alex
  • 95
  • 6
  • http://stackoverflow.com/questions/12841024/using-windows-dos-shell-batch-commands-how-do-i-take-a-file-and-only-keep-uniqu but I am not sure huge files can be done. – Joop Eggen Apr 13 '14 at 18:05

3 Answers3

2

If you don't mind the lines not being in the original order, you can use this command:

$ sort -u old_file.txt > new_file.txt

The sort will sort your file, and the -u option stands for unique which means that it will only output the first matching line.

Even with very large files, sort may be your best hope.

David W.
  • 105,218
  • 39
  • 216
  • 337
  • ah ok, thanks , i run it on terminal on linux and it works, first i thought it didn't but i open system monitor and was working. thanks a lot, my vote. – alex Apr 13 '14 at 19:12
1

Preserving the existing order (first time each line is found):

perl -i -wlne'our %uniq; $uniq{$_}++ or print' file.txt
ysth
  • 96,171
  • 6
  • 121
  • 214
0

This can also be done effectively using awk: http://awk.freeshell.org/AwkTips

awk '!a[$0]++'
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174