How to extract the text between a word and its next occurrence?

Question

I have the following sample text:

mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\begin{itemize}
  \item Item 1
  \item Item 2
\end{itemize}
\end{document}'''

What I'm trying to accomplish is to create a regex expression which will extract the following lines from mystr:

['This is introduction paragraph','This is non-introduction paragraph','    This is sample section paragraph\n \begin{itemize}\n\item Item 1\n\item Item 2\n\end{itemize}']

Your example does not illustrate the question. "quick elephant" does not have an occurance of the word "a" after it. — roarsneer, Oct 28 '16 at 11:47
http://stackoverflow.com/questions/743806/split-string-into-a-list-in-python has more detailed description, but the answer above is correct... — A. N. Other, Oct 28 '16 at 11:50
Good job on the edit; this helps people understand what you're after. Have you attempted anything thus far to tackle this problem? It might help posting it if you have. — Dimitris Fasarakis Hilliard, Oct 28 '16 at 12:07

chapelo · Answer 1 · 2016-10-28T19:48:00.893

2

For any reason you need to use regex. Perhaps the splitting string is more involved than just "a". The re module has a split function too:

import re
str_ = "a quick brown fox jumps over a lazy dog than a quick elephant"


print(re.split(r'\s?\ba\b\s?',str_))

# ['', 'quick brown fox jumps over', 'lazy dog than', 'quick elephant']

EDIT: expanded answer with the new information you provided...

After your edit in which you write a better description of your problem and you include a text that looks like LaTeX, I think you need to extract those lines that do not start with a \, which are the latex commands. In other words, you need the lines with only text. Try the following, always using regular expressions:

import re

mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\end{document}'''

pattern = r"^[^\\]*\n"


matches = re.findall(pattern, mystr, flags=re.M)

print(matches)

# ['This is introduction paragraph\n', 'This is non-introduction paragraph\n', 'This is sample section paragraph\n']

edited Oct 28 '16 at 19:48

answered Oct 28 '16 at 11:55

chapelo

2,519
13
19

Thanks you... But splitting won't solve my purpose. Sorry for the inability to write the question in full. – Rebhu Johymalyo Josh Oct 28 '16 at 12:00
@RebhuJohymalyoJosh: OK, just try to explain it better, perhaps with a real example and code of what you are trying to do. – chapelo Oct 28 '16 at 12:03
I have edited the question. Kindly have a look at it. – Rebhu Johymalyo Josh Oct 28 '16 at 12:06
@RebhuJohymalyoJosh: Check the edited answer. Hope it helps. – chapelo Oct 28 '16 at 20:08
Hey sorry for the trouble but that won't work either. I have updated the query again to give you a more detailed explaination. – Rebhu Johymalyo Josh Nov 01 '16 at 13:47

score 0 · Answer 2 · answered Oct 28 '16 at 11:54

0

You can use the split method from str:

my_string = "a quick brown fox jumps over a lazy dog than a quick elephant"
word = "a "
my_string.split(word)

Results in:

['', 'quick brown fox jumps over ', 'lazy dog than ', 'quick elephant']

Note: Don't use str as a variable name as it is a keyword in Python.

answered Oct 28 '16 at 11:54

José Sánchez

1,126
2
11
20

str is NOT a keyword in python. Its just a build in class. So technically there is no issue in using a str keyword – Rebhu Johymalyo Josh Oct 28 '16 at 11:58
1

@RebhuJohymalyoJosh while you are right that it is not a keyword you are wrong in stating that there's no issue in using it. Using `str` as a name for your variable will mask the built-in `str` and might lead to unexpected issues in the long run. In short, *avoid using it*. – Dimitris Fasarakis Hilliard Oct 28 '16 at 12:03
@Jose Sanchez: Your solution would give strange results, if you feed it with a word, that contains an "a" at the end, for example "a quick brown fox jumps over a lazy lama than a quick elephant". You could use " a " to split, if you would use " "+my_string – am2 Oct 28 '16 at 12:21

How to extract the text between a word and its next occurrence?

2 Answers2