0

I have this code:

 jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
 with open(ticket_file, 'r') as f:
     tickets = [word for line in f for word in line.split() if jira_regex.match(word) and word not in tickets]

ticket_file contains this:

PRJ1-2333
PRJ1-2333
PRJ1-2333
PRJ2-2333
PRJ2-2333
MISC-5002

After the code runs, the tickets list contains these:

['PRJ1-2333', 'PRJ1-2333', 'PRJ1-2333', 'PRJ2-2333', 'PRJ2-2333', 'MISC-5002']

I expected this:

['PRJ1-2333', 'PRJ2-2333', 'MISC-5002']

Why is word not in tickets condition not eliminating duplicates? The regex filter is working fine, however.

Mehul Gupta
  • 1,829
  • 3
  • 17
  • 33
codeforester
  • 39,467
  • 16
  • 112
  • 140

4 Answers4

2

You can use a set:

  • Sets can only contain unique values
    • I've used set(...) to be explicit, but set(...) can be replace with {...}.
    • This implementation builds a generator inside set()
    • Don't use a list-comprehension inside (e.g. set([...])), because the list can potentially use a lot of memory.
  • word not in tickets causes NameError: name 'tickets' is not defined because, from the perspective of the list comprehension, tickets does not exist.
    • If you're not getting a NameError, it's because tickets exists in memory already, or tickets is assigned in your code, but not this example.
    • Given the example code, if you clear the environment, and run the code, you'll get an error.
  • .match returns something like <re.Match object; span=(0, 9), match='PRJ1-2333'> or None
    • Where match = jira_regex.match(t), if there's a match, get the value with match[0].
    • word for line in f for word in line.split() if jira_regex.match(word) assumes that if jira_regex.match(word) isn't None that the match is always equal to word. Based on the sample data, this is the case, but I don't know if that's the case with the real data.
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open('test.txt', 'r') as f:
    tickets = set(word for line in f for word in line.split() if jira_regex.match(word))
    
print(tickets)

{'MISC-5002', 'PRJ1-2333', 'PRJ2-2333'}

Without .split():

  • It seems as if line.split() is being used to get rid of the newline, which can be accomplished with line.strip()

Option 1:

jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open('test.txt', 'r') as f:
    tickets = set(jira_regex.match(word.strip())[0] for word in f)  # assumes .match will never be None
    
print(tickets)
{'MISC-5002', 'PRJ1-2333', 'PRJ2-2333'}

Option 2:

jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open('test.txt', 'r') as f:
    tickets = set(word.strip() for word in f if jira_regex.match(word.strip()))
    
print(tickets)
{'MISC-5002', 'PRJ1-2333', 'PRJ2-2333'}

For the code to be explicit:

jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
tickets = list()
with open('test.txt', 'r') as f:
    
    for t in f:        
        t = t.strip()  # remove space from beginning and end and remove newlines
        match = jira_regex.match(t)  # assign .match to a variable
        if match != None:  # check if a match was found
            match = match[0]  # extract the match value, depending on the data, this may not be the same as 't'
            if match not in tickets:  # check if match is in tickets
                tickets.append(match)  # if match is not in tickets, add it to tickets

print(tickets)
['PRJ1-2333', 'PRJ2-2333', 'MISC-5002']
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
0

Why is word not in tickets condition not eliminating duplicates?

It is because the variable tickets does not exist yet until the list comprehension is finished.

You can do a set comprehension like this (not tested):

tickets = {word for line in f for word in line.split() if jira_regex.match(word)}
Ronie Martinez
  • 1,254
  • 1
  • 10
  • 14
  • How is using a set comprehension, different from my solution? – Trenton McKinney Jun 30 '20 at 04:33
  • @TrentonMcKinney you are building the list first from the list comprehension and then passing it to the `set()` function. There is a big overhead if the size of the items is huge. – Ronie Martinez Jun 30 '20 at 04:34
  • 1
    @TrentonMcKinney reviewed your code again. I think they are not different since you are just passing the the generator instead. I could be wrong. A benchmark can be an answer. – Ronie Martinez Jun 30 '20 at 04:36
  • 1
    I used `set()` to be explicit, but as you just noted, I'm not building a list comprehension first. – Trenton McKinney Jun 30 '20 at 04:38
  • Yeah, it is just a difference in coding style I guess. I most of the time use, list, set, and dict comprehensions and generator expressions. – Ronie Martinez Jun 30 '20 at 04:40
0

I'm assuming you predefined tickets in your code.
The reason the if statement is not working is because although you are adding more and more values into tickets, the tickets in your if statement will always be empty, so word is always not in.

I believe this is what you are trying to do:

 jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
 with open(ticket_file, 'r') as f:
     [tickets.append(word) for line in f for word in line.split() if jira_regex.match(word) and word not in tickets]
Red
  • 26,798
  • 7
  • 36
  • 58
  • If _I'm assuming you predefined tickets as an empty list in your code_ is the case, then this will work. However, it's considered anti-pythonic to implement this type of [`side-effect`](https://stackoverflow.com/questions/5753597/is-it-pythonic-to-use-list-comprehensions-for-just-side-effects). – Trenton McKinney Jun 30 '20 at 05:15
-1

Why is word not in tickets condition not eliminating duplicates? This is not working because when you are using

 tickets = [word for line in f for word in line.split() if jira_regex.match(word) and word not in tickets]

This is a list comprehension & hence will assign value to variable 'tickets' after reading all content from your file. Hence, in short, the condition, word not in tickets is literally adding nothing to the code as 'tickets' won't be assigned until every text is being read. What you can do is

 jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
 with open(ticket_file, 'r') as f:
     tickets = [word for line in f for word in line.split() if jira_regex.match(word)]
     tickets=set(tickets)

This will remove all your duplicate values

Mehul Gupta
  • 1,829
  • 3
  • 17
  • 33
  • This is essentially a copy of the first part of my solution, except with worse performance because you're building the entire list first and then using `set()` on the entire list. – Trenton McKinney Jun 30 '20 at 04:40