0

I have a text file with multiple lines, punctuation, and other word boundaries.

 Example:
 TITLE: Praying  SINGER: Kesha

 [Music video spoken intro:]
 "Am I dead? Or is this one of those dreams? Those horrible dreams that seem 
 like they last forever? If I am alive, why? Why? If there is a God or 
 whatever, something, somewhere, why have I been abandoned by everyone and 
 everything I've ever known? 
 I've ever loved? Stranded. What is the lesson? 
 What is the point? 

 TITLE: Don't Stop the Party  SINGER: Pitbull

  I say, y'all having a good time, I'll bet

    Yeah, yeah, yeah
    Que no pare la fiesta
    Don't stop the party
    Yeah, yeah, yeah
    Que no pare la fiesta
    Don't stop the party

The goal is to pull the lyrics of the song without including the title or singer. There are at least 10 of these in my file so it will need to pull all 10 using regex.

content = re.findall(r'TITLE\:\s\w+(.*)?\s*SINGER', file, re.DOTALL)
Zoey
  • 47
  • 3
  • It would be faster if you just use notepad++ or some other editor having regex capabilities. – Rahul Dec 10 '17 at 05:59

1 Answers1

0

Your approach is not totally wrong but it needs much more to achieve the disired goal. The following pattern captures the text between the titles in group $1:

(?:TITLE:\s.+?\s*SINGER:\s\w+?\s\r?\n?)(.+?)(?=\s+?TITLE:\s.+?\s*SINGER:\s\w+?\s+?|$)

[demo][1]

import re

regex = r"(?:TITLE:\s.+?\s*SINGER:\s\w+?\s\r?\n?)(.+?)(?=\s+?TITLE:\s.+?\s*SINGER:\s\w+?\s+?|$)"

test_str = (" TITLE: Praying  SINGER: Kesha\n\n"
    " [Music video spoken intro:]\n"
    " \"Am I dead? Or is this one of those dreams? Those horrible dreams that seem \n"
    " like they last forever? If I am alive, why? Why? If there is a God or \n"
    " whatever, something, somewhere, why have I been abandoned by everyone and \n"
    " everything I've ever known? \n"
    " I've ever loved? Stranded. What is the lesson? \n"
    " What is the point? \n\n"
    " TITLE: Don't Stop the Party  SINGER: Pitbull\n\n"
    "  I say, y'all having a good time, I'll bet\n\n"
    "    Yeah, yeah, yeah\n"
    "    Que no pare la fiesta\n"
    "    Don't stop the party\n"
    "    Yeah, yeah, yeah\n"
    "    Que no pare la fiesta\n"
    "    Don't stop the party\n\n"
    " TITLE: Don't Stop the Code  SINGER: Ugly Kid George\n\n"
    "  Ding Dong\n\n"
    "    Yeah, yeah, yeah\n"
    "    Que no pare la fiesta\n"
    "    Don't stop the party\n"
    "    Yeah, yeah, yeah\n"
    "    Que no pare la fiesta\n"
    "    Don't stop the\n\n"
    "   Ding Dong")

matches = re.finditer(regex, test_str, re.DOTALL | re.UNICODE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))


  [1]: https://regex101.com/r/g1BG5e/1
wp78de
  • 18,207
  • 7
  • 43
  • 71