2

I have a list of strings and I want to use regex to filter the list to certain strings.

Ex. Here is the original list:

quoteTitle = ['\r\n      ', ' ', '\r\n    ', '\r\n    ', '\r\n    ', '\r\n    ', '\r\n  ', '30. Loyalty', '29. Speed Scale', '28. Security', '27. Every Position', '26. Superior Brain Power', '25. A Long Line of Fighters', '24. Dwight Surveillance', '23. Friends ', '22. Pull the Plug ', '21. Second Life', '20. Accidentally vs. On Purpose', '19. Menstruation Wishes ', '18. Ideal Choice', '17. Healthcare in the Wild', '16. Superior Cousins', '15. Regular Ideas', '14. Immunity Logic', '13. The Person You Least, Medium and Most Suspect', '12. Real Heroes ', '11. Water Cooler Gossip', '10. Stress', '9. All These People!', '8. The “R” Sound', '7. A Woman’s Defects ', '6.Werewolf Hunting Experience ', '5. An Ideal World ', '4. Attention', '3. The Thing About Bear Attacks ', '2. Resume Critiquing', '1. Yeast Infections ', 'Tags', 'Recently in TV', '5/8/2018 3:45:00 PM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', 'Most Popular', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', 'TV', '4/3/2018 10:00:00 AM', 'TV', '4/3/2018 9:25:00 AM', 'Comedy', '3/22/2018 1:00:28 PM', 'TV', '3/15/2018 10:00:00 AM', 'Comedy', '3/13/2018 2:00:00 PM', 'TV', '3/10/2018 10:00:00 AM', 'TV', '3/2/2018 11:00:00 AM', 'TV', '2/25/2018 10:30:00 PM', 'TV', '2/23/2018 1:00:00 PM', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/3/2018 10:00:00 AM', '5/7/2018 10:00:00 AM', '4/26/2018 2:00:00 PM', '5/6/2018 10:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', '5/3/2018 12:00:00 AM', '5/3/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', '4/3/2018 10:00:00 AM', '4/3/2018 9:25:00 AM', '3/22/2018 1:00:28 PM', '3/15/2018 10:00:00 AM', '3/13/2018 2:00:00 PM']

I want only the numbered items and their text following from 30 to 1. I can successfully filter out anything that doesn't start with a number using

p = re.compile(r'\w')
q = filter(p.match, quoteTitle)
p = re.compile(r'^\d+')
q = filter(p.match, q)

This gets me to

print(list(q)) --> ['30. Loyalty', '29. Speed Scale', '28. Security', '27. Every Position', '26. Superior Brain Power', '25. A Long Line of Fighters', '24. Dwight Surveillance', '23. Friends ', '22. Pull the Plug ', '21. Second Life', '20. Accidentally vs. On Purpose', '19. Menstruation Wishes ', '18. Ideal Choice', '17. Healthcare in the Wild', '16. Superior Cousins', '15. Regular Ideas', '14. Immunity Logic', '13. The Person You Least, Medium and Most Suspect', '12. Real Heroes ', '11. Water Cooler Gossip', '10. Stress', '9. All These People!', '8. The “R” Sound', '7. A Woman’s Defects ', '6.Werewolf Hunting Experience ', '5. An Ideal World ', '4. Attention', '3. The Thing About Bear Attacks ', '2. Resume Critiquing', '1. Yeast Infections ', 'Tags', 'Recently in TV', '5/8/2018 3:45:00 PM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', 'Most Popular', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', 'TV', '4/3/2018 10:00:00 AM', 'TV', '4/3/2018 9:25:00 AM', 'Comedy', '3/22/2018 1:00:28 PM', 'TV', '3/15/2018 10:00:00 AM', 'Comedy', '3/13/2018 2:00:00 PM', 'TV', '3/10/2018 10:00:00 AM', 'TV', '3/2/2018 11:00:00 AM', 'TV', '2/25/2018 10:30:00 PM', 'TV', '2/23/2018 1:00:00 PM', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/3/2018 10:00:00 AM', '5/7/2018 10:00:00 AM', '4/26/2018 2:00:00 PM', '5/6/2018 10:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', '5/3/2018 12:00:00 AM', '5/3/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', '4/3/2018 10:00:00 AM', '4/3/2018 9:25:00 AM', '3/22/2018 1:00:28 PM', '3/15/2018 10:00:00 AM', '3/13/2018 2:00:00 PM']

Now I want to remove the dates in the list

I've tried a lot of combinations of this, but I think I'm missing something or not understanding. My thinking is to get all strings in the list that do not follow the format of the date entries.

p = re.compile(r'[^'\d+/]')
q = filter(p.match, q)

They start with an apostrophe because its a string of a quote and I think that might be my problem. Other than that, the format goes:

apostrophe, number (between 1-12 so \d+), /

That should be enough to filter out the date entries as long as I get it working correctly

Update: even tried this to search for elements of the list that have an AM or PM in them and still no luck

p = re.compile(r'[^(AM|PM)]')
q = filter(p.search, q)
  • 1
    Can you please update your question with the answer you expect or wish to receive from your program? I have a solution I suspect might work, but it depends on what you are actually looking for. – Ming May 10 '18 at 02:47

1 Answers1

1

You can search for strings that start with a digit and a .:

import re
quoteTitle = ['\r\n      ', ' ', '\r\n    ', '\r\n    ', '\r\n    ', '\r\n    ', '\r\n  ', '30. Loyalty', '29. Speed Scale', '28. Security', '27. Every Position', '26. Superior Brain Power', '25. A Long Line of Fighters', '24. Dwight Surveillance', '23. Friends ', '22. Pull the Plug ', '21. Second Life', '20. Accidentally vs. On Purpose', '19. Menstruation Wishes ', '18. Ideal Choice', '17. Healthcare in the Wild', '16. Superior Cousins', '15. Regular Ideas', '14. Immunity Logic', '13. The Person You Least, Medium and Most Suspect', '12. Real Heroes ', '11. Water Cooler Gossip', '10. Stress', '9. All These People!', '8. The “R” Sound', '7. A Woman’s Defects ', '6.Werewolf Hunting Experience ', '5. An Ideal World ', '4. Attention', '3. The Thing About Bear Attacks ', '2. Resume Critiquing', '1. Yeast Infections ', 'Tags', 'Recently in TV', '5/8/2018 3:45:00 PM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', 'Most Popular', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', 'TV', '4/3/2018 10:00:00 AM', 'TV', '4/3/2018 9:25:00 AM', 'Comedy', '3/22/2018 1:00:28 PM', 'TV', '3/15/2018 10:00:00 AM', 'Comedy', '3/13/2018 2:00:00 PM', 'TV', '3/10/2018 10:00:00 AM', 'TV', '3/2/2018 11:00:00 AM', 'TV', '2/25/2018 10:30:00 PM', 'TV', '2/23/2018 1:00:00 PM', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/3/2018 10:00:00 AM', '5/7/2018 10:00:00 AM', '4/26/2018 2:00:00 PM', '5/6/2018 10:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', '5/3/2018 12:00:00 AM', '5/3/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', '4/3/2018 10:00:00 AM', '4/3/2018 9:25:00 AM', '3/22/2018 1:00:28 PM', '3/15/2018 10:00:00 AM', '3/13/2018 2:00:00 PM']
new_result = list(filter(lambda x:re.findall('^\d+\.', x), quoteTitle))

Output:

['30. Loyalty', '29. Speed Scale', '28. Security', '27. Every Position', '26. Superior Brain Power', '25. A Long Line of Fighters', '24. Dwight Surveillance', '23. Friends ', '22. Pull the Plug ', '21. Second Life', '20. Accidentally vs. On Purpose', '19. Menstruation Wishes ', '18. Ideal Choice', '17. Healthcare in the Wild', '16. Superior Cousins', '15. Regular Ideas', '14. Immunity Logic', '13. The Person You Least, Medium and Most Suspect', '12. Real Heroes ', '11. Water Cooler Gossip', '10. Stress', '9. All These People!', '8. The \xe2\x80\x9cR\xe2\x80\x9d Sound', '7. A Woman\xe2\x80\x99s Defects ', '6.Werewolf Hunting Experience ', '5. An Ideal World ', '4. Attention', '3. The Thing About Bear Attacks ', '2. Resume Critiquing', '1. Yeast Infections ']

Edit: to find all data between the quotes, you can use .*?:

quote = ['i dont want this', '\r\n ', '\r\n ', ' "this is the quote i want to extract" ', '" and also this one"', '\r\n "and me"']
new_results = list(map(lambda x:x[0], filter(None, [re.findall('"(.*?)"', i) for i in quote])))

Output:

['this is the quote i want to extract', ' and also this one', 'and me']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • Thank you! I didn't think to search for what i wanted instead of excluding what i didn't – Tyler Estes May 10 '18 at 03:48
  • follow up question if you don't mind. I have another list of strings like so: quote = ['i dont want this', '\r\n ', '\r\n ', ' "this is the quote i want to extract" ', '" and also this one"', '\r\n "and me"'] any suggestion how to get only the quotes inside the list of strings? tried following https://stackoverflow.com/questions/171480/regex-grabbing-values-between-quotation-marks with no luck – Tyler Estes May 10 '18 at 04:05