2

I have a string in python, and would like to remove duplicate lines (i.e. when the text between \n is the same, then remove the second (third, fourth) occurrence, but preserve the order of the string. for example

line1 \n line2 \n line3 \n line2 \n line2 \n line 4

would return:

line1 \n line2 \n line3 \n line 4

Other examples i have seen on stackoverflow deal with at the stage of reading the text file into python (e.g. using readline(), seeing if already in a set of read in lines, and then adding to string only if it is unique). in my instance this doesn't work, as the string I have has already been heavily manipulated since loading into python... and it seems very botched to e.g. write the whole string to a txt file, and then read in line-by-line looking for duplicated lines

kyrenia
  • 5,431
  • 9
  • 63
  • 93

2 Answers2

12

For Python 2.7+, this can be done with a one-liner:

from collections import OrderedDict

test_string = "line1 \n line2 \n line3 \n line2 \n line2 \n line 4"

"\n".join(list(OrderedDict.fromkeys(test_string.split("\n"))))

This gives me: 'line1 \n line2 \n line3 \n line 4'

nullstellensatz
  • 756
  • 8
  • 18
2
>>> lines = "line1 \n line2 \n line3 \n line2 \n line2 \n line 4"
>>> seen = set()
>>> answer = []
>>> for line in lines.splitlines():
...     if line not in seen:
...             seen.add(line)
...             answer.append(line)
... 
>>> print '\n'.join(answer)
line1 
 line2 
 line3 
 line 4
>>> '\n'.join(answer)
'line1 \n line2 \n line3 \n line 4'
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241