-1

edited to divert attention from the type of text data I'm using and redirect attention to the actual question

Further edited to note that:

  1. The linked question on the closing notice does not in any way help me with my issue, and in fact went as far as to further confuse me with an abundance of criteria and syntax
  2. I have found a solution to the problem herein and will share it as an answer upon this question reopening so that anyone else with this same issue will have at least a starting point, if not a resolution to this issue.

Situation:

I have written a program that uses the requests module to grab all the text from a website, given that I use the exact same code for a system that does work, this piece is not an issue. I am trying to use re.findall() to grab data in the order it appears. In the system that works, the line I use is

paragraphs = re.findall(r'c1(.*?)c1', str(mytext))

where c1 stands in place of my first set of criteria I then use a few lines to get rid of what I don't need.

What I've tried:

I've attempted the following pieces of code, and none have worked. The information I've been able to find sadly doesn't address my issue. We could theorise all day as to why a guide for this is scarce, but the fact is a few hours of google got me nowhere.

First attempt:

I tried simply keeping it in-line

re.findall(r'c1(.*?)c1c2(.*?)c2', str(mytext))

where c2 stands in place of my second criteria Unfortunately this returns [] which is useless for me.

Second attempt:

I thought that maybe the way I did this was wrong, so I shuffled it around a bit

re.findall(r'c1(.*?)c1', r'c2(.*?)c2', str(mytext))

re.findall(r'c1(.*?)c1'r'c2(.*?)c2', str(mytext))

re.findall(r'c1(.*?)c1' or 'c2(.*?)c2', str(mytext))

re.findall(r'c1(.*?)c1' or r'c2(.*?)c2', str(mytext))

But in the case of the first two, same as my initial attempt. The last two got only c1(.*?)c1, which is useful data, but it doesn't contain the c2(.*?)c2 at all, let alone in the order it appears in the text.

Third attempt:

Don't run this code this crashed my laptop with an infinite loop. I had done some research by this point and discovered the re.search() function

paragraphs = []
ticker = ''
while ticker != 'None':
    ticker = re.search(r'c1(.*?)c1', str(mytext))
    if (ticker == 'None'):
        ticker = re.search(r'c2(.*?)c2', str(mytext))
    if (ticker != 'None'):
        paragraphs.append(ticker)
print(paragraphs)

Clearly, this was a dumb idea. It tried to make the paragraphs[] have an infinite list of the first c1(.*?)c1.

Question:

How, if at all, do I use re.findall() to create a list paragraphs that will go through the text in mytext and pick out everything that meets the criteria c1(.*?)c1 and c2(.*?)c2 and place them in the order they appear?

eg if the text is (spaces added for clarity, will not exist in file)

c2 hello c2 c1 world c1 c2 !!! c2

The program will be

#get the text
#do the re.findall() function and assign to the list paragraphs
print(paragraphs)

And will return

>>>['hello', 'world', '!!!']
WCJ277
  • 29
  • 8
  • 6
    may I ask, why you're not using an HTML parser like beautifulsoup? [regex is not meant to parse HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Chase Jul 10 '20 at 13:32
  • This is the kind of problem I'm always tempted to work on, but I know I'll get downvoted if I actually offer an RE-based answer. I like REs too, but I have to agree with the previous comment, it's better to use an HTML parser... It'll save you a lot of gotchas in the long run. – joanis Jul 10 '20 at 13:54
  • @Chase I much prefer regex to beautifulsoup even though I'm more familiar with bs4 than regex. I'm making sure to copy the content of the HTML to a python variable so that I'm not directly parsing the HTML file itself. – WCJ277 Jul 10 '20 at 14:17
  • @joanis if you know a solution in regex I would much appreciate it, even if we have to go off-site to avoid downvotes – WCJ277 Jul 10 '20 at 14:19
  • 2
    Looks like you need `[x.group() for x in re.finditer(r'(c1|c2)(.*?)\1', mytext)]` or `[x.group() for x in re.finditer(r'(c1|c2)((?:(?!c1|c2).)*?)\1', mytext, flags=re.S)]` – Wiktor Stribiżew Jul 10 '20 at 14:51
  • also just to allude to the snippet that "crashes computers". You're comparing `ticker` to `'None'`. In python, `None` does not equal `'None'` - one is a string literal and the other....well it's a keyword for `NoneType`. So that loop is literally infinite and it is so for a rather obvious reason. – Chase Jul 10 '20 at 14:59
  • @WiktorStribiżew fine answer! Remember to do `.group(2)` to achieve OP's desired output. Note should be taken when using `re.S` though, `DOTALL` is dangerous....and also one of the reasons html shouldn't be parsed with regex. I hope OP's texts within `c1/c2` tags don't contain newlines so they don't have to use `DOTALL` – Chase Jul 10 '20 at 15:06
  • 1
    @WCJ277 Good idea to rewrite the question like you did, now at least people are really reading your question. I think the solution by Wiktor will work for you. The magic you need is there: `(c1|c2)` to match either criterion, and then `\1` to match it again where each instance ends. If the opening `c2` and closing `c2` are not actually verbatim equal in your real problem, you might need something a bit fancier, but the base logic should still work. – joanis Jul 10 '20 at 15:12
  • Check [Python demo](https://ideone.com/M8zF2D) and a [regex demo](https://regex101.com/r/RWs2GW/1). – Wiktor Stribiżew Jul 10 '20 at 15:57
  • @chase oh no I'm not comparing to the python None, theres actual strings which fit my criteria that read "None". I should've clarified that in the post, my bad – WCJ277 Jul 12 '20 at 11:17
  • @joanis it would be great if moderators didn't close it. The linked question definitely isn't the same as mine, and the answers only serve to make me more confused about regex as its full of syntax that I guess most people have committed to memory. – WCJ277 Jul 12 '20 at 13:17
  • Just voted to reopen, because I think it's an interesting and worthwhile question, sufficiently different from the duplicate link. Do Wiktor's Python and regex demos in his last comment above solve your problem? – joanis Jul 12 '20 at 13:46
  • This was not closed by a moderator. The dupicate shows exactly how to use a single regex to check a string for multiple conditions; if that's what you want, you need lookaheads. Of course, you can always `if re.search(r'c1', str) and re.search(r'c2', str):` but that doesn't seem to be what you are asking. – tripleee Jul 12 '20 at 13:50
  • Please provide feedback on the solution posted. – Wiktor Stribiżew Jul 19 '20 at 09:36

2 Answers2

0

You may use

[x.group(2) for x in re.finditer(r'(c1|c2)(.*?)\1', mytext, flags=re.S)]

See the regex demo. Or, to match the shortest substrings:

[x.group(2) for x in re.finditer(r'(c1|c2)((?:(?!c1|c2).)*?)\1', mytext, flags=re.S)]

The regex matches

  • (c1|c2) - Group 1: c1 or c2
  • (.*?) - Group 2: any 0 or more chars as few as possible
  • \1 - the same value as in Group 1.

The for x in re.finditer(r'(c1|c2)(.*?)\1', mytext) iterates over all matches and x.group(2) will return Group 2 values only.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
-1

Try to put "OR" in between for multiple conditions:

re.findall(r'c1(.*?)c2', mytext) or re.findall(r'c2(.*?)c3', mytext)
Keshav
  • 7
  • 2