-1

Hi I have a text file in the following format:

Sam
John
Peter
Sam 
Peter
John

I want to extract the unique records using REGULAR EXPRESSION from the file such as:

Sam
John
Peter

Please help me out.

Prashant
  • 56
  • 1
  • 4

3 Answers3

6

Use set:

In [1]: name="""
   ...: Sam
   ...: John
   ...: Peter
   ...: Sam 
   ...: Peter
   ...: John"""

In [2]: print name

Sam
John
Peter
Sam 
Peter
John

In [3]: a=name.split()

In [4]: a
Out[4]: ['Sam', 'John', 'Peter', 'Sam', 'Peter', 'John']

In [5]: set(a)
Out[5]: {'John', 'Peter', 'Sam'}
Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
  • thanks for the answer..but i want the same output using regular expressions in python – Prashant Sep 27 '13 at 11:07
  • 4
    @Prashant Do you know what regular expressions are? This isn't a problem that can be solved by them. – l4mpi Sep 27 '13 at 11:11
  • @l4mpi You mean "this isn't a problem that *should* be solved by them". They can actually solve this. – Veedrac Sep 27 '13 at 15:42
  • 1
    @Veedrac no, I actually meant "can't solve this". Finding duplicates is outside of the domain of regular languages, which is what I think of when I hear "regular expression". I know that there's backreferences etc, and with this RE are actually able to process far more than regular languages, but that's not a "real" regular expression in my view. And using RE for non-regular things is probably always hackish at best and stupid at worst... (I still upvoted your answer though, appreciate the humor ^^) – l4mpi Sep 27 '13 at 18:01
5

Don't listen to them!

Of course this can be done in Regex. Never mind that they have the correct, O(1) solution that's readable and concise, or that any Regex solution will be at least quadratic-time and about as readable as a drunkard's scrawling.

What matters is that it's Regex, and Regex must be good. Here you go:

re.findall(r"""(?ms)^([^\n]*)$(?!.*^\1$)""", target_string)
#>>> ['Sam', 'Peter', 'John']
Veedrac
  • 58,273
  • 15
  • 112
  • 169
4

seems like you want to create a list by splitting the input by new line and then removing duplicates using set(). you can then convert that to a list using list(). looks something like below. The strip() is used to remove the newline characters.

names = list(set([x.strip() for x in open('names.txt').readlines()]))
olly_uk
  • 11,559
  • 3
  • 39
  • 45