0

I have the following problem, I have a text that has some lines like this:

20   luz de las remotísimas estrellas.

When im saying "like this", I mean with that number at the left, that indicates the line of the chapter. I also have some lines that have this:

es ya una distracción en esta ociosidad perdurable!           (P126)

There lines indicate a new page of the book.

The question is, is there a simple method to remove those numbers and parenthesis from the lines? I've already used regex to eliminate "[]" with numbers inside but I don't completely understand it.

martineau
  • 119,623
  • 25
  • 170
  • 301
Oto
  • 121
  • 5
  • 1
    Maybe this can give you some hint for the regex. https://stackoverflow.com/questions/58208746/how-to-remove-parentheses-and-all-data-within-using-python3 Changing it to handle the P instead of any text. And this is one of many page where you can study and try your regex. [regex101](https://regex101.com/). – akane Oct 15 '20 at 00:52
  • 1
    It's also appreciated if you can give some example what you tried and provide [minimal-reproducible-example](https://stackoverflow.com/help/minimal-reproducible-example). – akane Oct 15 '20 at 00:54

1 Answers1

1

You can use groups

In this case, I assume you want to remove the 20 from "20 luz de las remotísimas estrellas" and the (P126) from "es ya una distracción en esta ociosidad perdurable! (P126)"

for both you can use this function

import re  

def clean_line(line):
  regex = r"(\d*)?([^\(]*)(\(P\d+\))?"
  return re.match(regex, line).group(2)

Here we are grouping, with regex you can group parts of matches using parenthesis, indeed we got three groups in this regex:

(\d*) captures a number of an arbitrary number of digits.

([^(]*) captures an string until it finds an '('

((P\d+)) captures a string in the form '(Pnumber)' where number is any positive integer.

The ? marks mean that the group before is optional.

As we are interested in the second group and the 0 group is the string captured for the whole regex, we call group(2) on in.

Let me know if this answer is useful please.