0

I have the following code

listnew= ['E-Textbooks','Dynamic', 'Case', 'Management', '(', 'DCM', ')']. 
nounbreak = list(itertools.chain(*[re.findall(r"\b\w+\b(?![\(\w+\)])", i) for i in listnew]))

While the above code successfully removes '-' and even '/'. It somehow is not able to ignore the words in the brackets

The ideal output required is

['E', 'Textbooks','Dynamic', 'Case', 'Management']

How do I tweak the above regex expression itself to render the above desired output?

venkatttaknev
  • 669
  • 1
  • 7
  • 21
  • What `brackets` does this refer to ? –  Dec 04 '19 at 19:10
  • This class `[\(\w+\)]` is better written as `[+()\w]` –  Dec 04 '19 at 19:12
  • brackets as in parentheses. this -> '(', 'DCM', ')' – venkatttaknev Dec 04 '19 at 19:13
  • This `'(', 'DCM', ')'` is a string with comma's and quotes, I don't understand. –  Dec 04 '19 at 19:14
  • yes I basically want this line of code `nounbreak = list(itertools.chain(*[re.findall(r"\b\w+\b(?![\(\w+\)])", i) for i in listnew]))` to ignore ` ''(', 'DCM', ')' ` – venkatttaknev Dec 04 '19 at 19:16
  • So the real question is how to filter a list to remove `'(', 'DCM', ')'` then? I feel like this list was probably created from a `re.findall` call that might be misguided. I'm getting a bit of a [x-y problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) feeling... If this is your only list, why not just use `listnew[:-3]` and be done with it? What is the original input and requirements for filtering exactly? – ggorlen Dec 04 '19 at 19:17
  • Does `for i in listnew` loop through the individual values in an array ? Does it combine them into a single string before running the regex on it ? If so, `\b\w+\b` will have a hard time matching much. –  Dec 04 '19 at 19:20
  • @ggorlen I do realize that there are other ways to filter a list to remove `'(', 'DCM', ')'` but the thing is I want to retain the code `list(itertools.chain(*[re.findall(r"\b\w+\b(?![\(\w+\)])", i) for i in listnew]))` and make tweaks to this code alone for sake of brevity or simplicity and achieve the desired result – venkatttaknev Dec 04 '19 at 19:21
  • @x15 unfortunately, `for i in listnew` does loop through individual values in the list. – venkatttaknev Dec 04 '19 at 19:23
  • You'd be better off Joining on a non-word, like comma to make a single string. Then running findall using a regex to create a new list. I can give you a regex for that if you need to do it that way. –  Dec 04 '19 at 19:26
  • @x15 sure, let me try that. Kindly give me that regex – venkatttaknev Dec 04 '19 at 19:33
  • I'll post it as an answer. –  Dec 04 '19 at 19:34
  • @ggorlen My input is user driven. Though it may not exceed 5 words at any point. The position of words having () can occur at any position 0, 1, 2, ,3, 4 i.e. the index I am referring to. Now listnew[:-3] will work only if always my words in parentheses occur at that position. Hence listnew[:-3] is not a good solution. – venkatttaknev Dec 04 '19 at 19:48
  • Clearly, but without context there's no way to know that--slicing off the last three elements works great on the only data you provided, so you need to explain what you're _generally_ trying to achieve. Likely, `listnew` was created with a prior regex split or findall of some sort on the input you mentioned above, but it's awkward to work with and is motivating ugly solutions that try to shoehorn it into working. Why not provide the raw user input string and the result structure you want and let someone provide a better way to solve the problem X than the Y code that you've provided? – ggorlen Dec 04 '19 at 19:53

2 Answers2

1

Your problem is that your regex looks at each list element seperately - it can not "see" that there are "(" and ")" elements before/after the current element it looks at.

I propose cleaning your list beforehand:

import re
from itertools import chain

listnew = ['E-Textbooks','Dynamic', 'Case', 'Management', '(', 'DCM', ')'] 

# collect indexes of elements that are ( or ) or things between them
# does not work for ((())) - you might need to do something more elaborate
# if that can happen
remove = []
for i,k in enumerate(listnew):
    if k == "(":
        remove.append(i)
    elif k != ")" and remove and i == remove[-1]+1 and remove[-1] != ")":
        remove.append(i)
    elif k == ")":
        remove.append(i)

data = [k for i,k in enumerate(listnew) if i not in frozenset(remove)]


# did not touch your regex per se - you might want to simplify it using regex101.com
nounbreak =  list(chain(*[re.findall(r"\b\w+\b(?![\(\w+\)])", i) for i in data]))

print(nounbreak)

Output:

['E', 'Textbooks', 'Dynamic', 'Case', 'Management']

If you only have short lists - you could also ' '.join(..) them and clean the string from things inside parenthesis - see f.e. Regular expression to return text between parenthesis on how to accomplish this and remove it from the string.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • thanks for your answer. yes, Indeed in my use case I will have short lists alone. Say not more than 5 words in a list. Could u please provide an solution with the ' '.join(..) also . I reckon that will be lesser lines of codes and perhaps more efficient. – venkatttaknev Dec 04 '19 at 19:39
  • 1
    @venk `','.join(listnew)` and @x15's regex should work for that combined with `result = [i for i in re.findall( r"\(.*?\)|\b(\w+)\b", str ) if i]` to eleminiate "empty" matches – Patrick Artner Dec 04 '19 at 19:45
1

This is a sparse solution just demonstrating the regex.
Basically joins the array on a non-word, comma in this case, then
runs a regex on it using findall.
The parenthesis elements will be empty strings that can be filtered
via list compression.

The regex :

   \( .*? \) 
|  \b
   ( \w+ )                       # (1)
   \b

Python code :

>>> import re
>>> list_orig = ['E-Textbooks','Dynamic', 'Case', 'Management', '(', 'DCM', ')']
>>> str = ','.join( list_orig )
>>> list_new = re.findall( r"\(.*?\)|\b(\w+)\b", str )
>>> list_new = [i for i in list_new if i]
>>> print( list_new )
['E', 'Textbooks', 'Dynamic', 'Case', 'Management']