0

I'm loading in a data file that uses special formatting which includes ** to separate the file into sections. For example: **HEADER, **COMMENTS, **CONSTANTS, and **DATA are all section titles within the file, and each section needs to be handled differently. So I'm trying to index the locations of each section title all of which start with double asterisks.

I currently have:

Titles = [m.start() for m in re.finditer('e', mytxt)]

Which indexes the location of every e in the file. However:

Titles = [m.start() for m in re.finditer('**', mytxt)]

gives me:

error: nothing to repeat

I also tried:

Titles = [m.start() for m in re.finditer(r'**', mytxt)]

thinking it would turn the search term into raw text and stop trying to handle * as a special character but it didn't work.

Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
  • Welcome to Stack Overflow! Please take the [tour], read [what's on-topic here](/help/on-topic), [ask], and the [question checklist](//meta.stackoverflow.com/q/260648/843953), and provide a [mre] that people can paste into their environments and run as-is to reproduce your error. In this case, some example text would be nice. – Pranav Hosangadi Apr 20 '21 at 15:39

1 Answers1

0

The problem is that you need to escape those asterisks with backslashes!. Simply putting them in a raw string doesn't do anything.

mytxt = """**HEADER
abc
def
**COMMENTS
ghi
jkl
**DATA
123
546
789"""

titles = [m.start() for m in re.finditer(r'\*\*', mytxt)]
print(titles) # gives [0, 17, 36]

If you want to ensure that these are at the start of a line, or that the thing after the asterisks should be one of a few keywords, you could add that to the regex:

mytxt = """**HEADER
abc
def
**COMMENTS
ghi
jkl
**DATA
**junkheader
123**000
546**123
789"""

titles = [m.start() for m in re.finditer(r'\*\*', mytxt)]
print(titles) # gives [0, 17, 36, 43, 59, 68]

# But, 
titles = [m.start() for m in re.finditer(r'^\*\*', mytxt, re.MULTILINE)]
print(titles) # gives [0, 17, 36, 43]

# If you know valid titles beforehand
known_titles = ["HEADER", "COMMENTS", "CONSTANTS", "DATA"]
regex = r"^\*\*(" + "|".join(known_titles) + ")$"
print(regex) # Output: ^\*\*(HEADER|COMMENTS|CONSTANTS|DATA)$

titles = [m.start() 
     for m in re.finditer(regex, mytxt, re.MULTILINE)]
print(titles) # Gives [0, 17, 36]

re.MULTILINE allows you to recognize new line characters inside the string as the beginning of a new line.

The ^ in ^\*\* at the start of the regex forces the asterisks to be at the start of a new line.

So this regex: ^\*\*(HEADER|COMMENTS|CONSTANTS|DATA)$ means:

  • ^: Match at the beginning of the line
  • \*\*: Literally two asterisks
  • (HEADER|COMMENTS|CONSTANTS|DATA): One of those words
  • $: End of the line

Try the regex at Regex101

Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
  • WOW that worked, thank you! Could you kindly explain why it works? I get the r indicates raw text, but shouldn't that then search for \*\* ? What the \ mean? – GregorySmithUK Apr 20 '21 at 15:46
  • The example to ensure the ** are at the start wont work for me as it requires that I already know the header titles. Those are examples but there many more acceptable titles. And users can add their own. – GregorySmithUK Apr 20 '21 at 15:52
  • @GregorySmithUK Because `*` has [special meaning](https://www.rexegg.com/regex-quickstart.html#quantifiers) in regex. To tell the regex engine that you want to match a literal `*`, you need to _escape_ that `*`. You do that with a backslash. Raw text is a _python_ feature to parse backslashes in strings, it has nothing to do with regex or escaping other characters – Pranav Hosangadi Apr 20 '21 at 15:53
  • @GregorySmithUK I modified my example for the start of the line and added some explanation. You don't need to know the titles to match them at the start of the line – Pranav Hosangadi Apr 20 '21 at 15:59