I'd like to write a function that removes the head of Project Gutenberg texts using RegEx.
So far I did the following (not as a function), that worked well: (dracula is a textstring of the Dracula novel from Projekt Gutenberg)
pattern = r'START OF THIS PROJECT GUTENBERG EBOOK (.)+?\*\*\*'
draculalist = re.split(pattern, dracula, maxsplit=1)
draculalist.pop(0)
dracula = ''.join(draculalist)
print dracula[:100]
-> worked well, as well as:
headend = re.search(pattern, dracula).end()
dracula = dracula[headend:]
Then I tried to write a function:
def head_removal(text):
"""Entfernung der Meta-Daten im Kopf der Projekt Gutenberg Texte"""
headend = re.search(pattern, text).end()
text = text[headend:]
The problem is that when I use the function with a certain text like
head_removal(dracula)
it will not change the string 'dracula' as it is immutable of course but gives me the object 'text' which has the dracula-text without the header. So I tried it with the other code that splits the string into a list and than joins it again:
pattern = r'START OF THIS PROJECT GUTENBERG EBOOK (.)+?\*\*\*'
def head_removal2(texts):
"""Entfernung der Meta-Daten im Kopf der Projekt Gutenberg Texte"""
liste = re.split(pattern, texts, maxsplit=1)
liste.pop(0)
texts = ''.join(liste)
Doesn't work either for head_removal(dracula) print dracula[:100]
Any idea how to write that function?