1

I am wondering if there is a better approach than what I am currently taking to parse this file. I have a string that is in the general format of:

[Chunk of text]
--------------------
[Another chunk of text]

(There can be multiple chunks of text with the same separator between them)

I am trying to parse the chunks of text into elements of a list, which I can do with data.split('-'*20) [in this case], however if there are not exactly 20 hyphens the split will not work as intended. I have been playing around with regex however am currently unsure of a proper regex that could be used.

Are there any better methods that I should use in this situation, or is there a regex I should use oppose to the .split() method?

Mark N
  • 326
  • 2
  • 13

2 Answers2

1

I would try to use re.split() with the regex --+ which means:

  1. - - one hyphen
  2. -+ - one or more hyphens

... this way it would not match a single hyphen, but everything more than one, alternatively you could use -{2,} which means two or more.

Mark N
  • 326
  • 2
  • 13
m.cekiera
  • 5,365
  • 5
  • 21
  • 35
  • As far as I know, this will capture the hyphens themselves, resulting in the text being ignored (which is not the intended result). – Mark N Jul 13 '15 at 15:41
  • @MarkN But, did you used split() with that? I thought that this is what you ask about, regex to split text – m.cekiera Jul 13 '15 at 15:42
  • I see..I understand now, thank you. You may wish to add this to your answer? [About the combination of regex and split] – Mark N Jul 13 '15 at 15:43
  • You may want to use `^--+$` (or `^-{2,}$`) so that it splits only on a line containing only hyphens and doesn't split on two or more hyphens within the text. – Miles Budnek Jul 13 '15 at 15:45
  • Would it become more complex to try and handle cases with different numbers of hyphens per separator, or could this be handled as well with this combination, because .split() would not cooperate as wanted? (Not required) – Mark N Jul 13 '15 at 15:47
  • @MarkN This regex (as well as these from comments) will match every number of hyphens more than one, you can test it [here](https://regex101.com/r/qT0oY1/1), however keep atention to a fact, that in the example the multiline and global match modes are used – m.cekiera Jul 13 '15 at 15:51
  • @MarkN I don't exactly understand what you mean... this regex will split text if there will be more than one hyphen in a row: 2,4,10,5689 or more. But as long as they are in a row '------' they are treated as single match. I encurage you to past your example text to regex101 web site and try with this regex, you will see how it exactly works – m.cekiera Jul 13 '15 at 16:04
  • @m.cekiera I was confusing your reference of split() with str.split() instead of re.split(). – Mark N Jul 13 '15 at 16:12
  • @MarkN I sorry again, I was concentrated on regex, as I don't know Python to well. Thx for your edit on my post – m.cekiera Jul 13 '15 at 16:16
1

You want a regex split. I'm not python-literate, but I found the function in the official 2.7.10 documentation, and modified to your case:

>>> re.split('\n\-{4,}\n', input)
  • 4 is the minimum amount of dashes you want to match.
  • \n are the newlines before and after. You probably don't want those in your text.
Rahmi Aksu
  • 450
  • 5
  • 10