remove duplicate lines in a txt document and keep one?

Question

i am not programmer, but i would like some help to remove duplicate lines in a document and keep only original lines. i was trying to do this with some text processors, editpadpro, but since my file is more than 1 gigabyte, always gets frozen and can't complete the operation.

i know perl is very good at this, but i don't know how to use it, keeping in mind that the file can be over 1 or 2 gB.

example of input lines:

line 1 
line 2
line 3
line 1
line 2
line 4
line 1

example of output lines:

line 1 
line 2
line 3
line 4

i am sorry if this is very basic, but i really don't know how to proceed, most of the time i use built in functions, i hope not to annoy anyone with this question.

http://stackoverflow.com/questions/12841024/using-windows-dos-shell-batch-commands-how-do-i-take-a-file-and-only-keep-uniqu but I am not sure huge files can be done. — Joop Eggen, Apr 13 '14 at 18:05

score 2 · Accepted Answer · answered Apr 13 '14 at 18:36

2

If you don't mind the lines not being in the original order, you can use this command:

$ sort -u old_file.txt > new_file.txt

The sort will sort your file, and the -u option stands for unique which means that it will only output the first matching line.

Even with very large files, sort may be your best hope.

answered Apr 13 '14 at 18:36

David W.

105,218
39
216
337

ah ok, thanks , i run it on terminal on linux and it works, first i thought it didn't but i open system monitor and was working. thanks a lot, my vote. – alex Apr 13 '14 at 19:12

score 1 · Answer 2 · answered Apr 13 '14 at 20:43

1

Preserving the existing order (first time each line is found):

perl -i -wlne'our %uniq; $uniq{$_}++ or print' file.txt

answered Apr 13 '14 at 20:43

ysth

96,171
6
121
214

is file.txt the name of my input text? – alex Apr 13 '14 at 20:48
yes, it is the name of your input text (and the file will be modified unless you remove -i) – ysth Apr 13 '14 at 21:35

score 0 · Answer 3 · answered Apr 13 '14 at 19:14

0

This can also be done effectively using awk: http://awk.freeshell.org/AwkTips

awk '!a[$0]++'

answered Apr 13 '14 at 19:14

Håkon Hægland

39,012
21
81
174

remove duplicate lines in a txt document and keep one?

3 Answers3