0

I'd like to write a function that removes the head of Project Gutenberg texts using RegEx.

So far I did the following (not as a function), that worked well: (dracula is a textstring of the Dracula novel from Projekt Gutenberg)

pattern = r'START OF THIS PROJECT GUTENBERG EBOOK (.)+?\*\*\*'

draculalist = re.split(pattern, dracula, maxsplit=1)
draculalist.pop(0)
dracula = ''.join(draculalist)
print dracula[:100]

-> worked well, as well as:

headend = re.search(pattern, dracula).end()
dracula = dracula[headend:]

Then I tried to write a function:

def head_removal(text):
    """Entfernung der Meta-Daten im Kopf der Projekt Gutenberg Texte"""
    headend = re.search(pattern, text).end()
    text = text[headend:]

The problem is that when I use the function with a certain text like

head_removal(dracula)

it will not change the string 'dracula' as it is immutable of course but gives me the object 'text' which has the dracula-text without the header. So I tried it with the other code that splits the string into a list and than joins it again:

pattern = r'START OF THIS PROJECT GUTENBERG EBOOK (.)+?\*\*\*'

def head_removal2(texts):
    """Entfernung der Meta-Daten im Kopf der Projekt Gutenberg Texte"""
    liste = re.split(pattern, texts, maxsplit=1)
    liste.pop(0)
    texts = ''.join(liste)

Doesn't work either for head_removal(dracula) print dracula[:100]

Any idea how to write that function?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Fadinha
  • 59
  • 3

1 Answers1

0

Python is "pass by value" which means you can't "assign" a value to the original reference inside your function - you'll have to return the result from the function and assign it back to the original string.

def modify_test(dracula):
    pattern = r'START OF THIS PROJECT GUTENBERG EBOOK (.)+?\*\*\*'
    draculalist = re.split(pattern, dracula, maxsplit=1)
    draculalist.pop(0)
    dracula = ''.join(draculalist)
    return dracula[:100]

# call it and re-assign:
dracula = modify_test(dracula)
Community
  • 1
  • 1
Nir Alfasi
  • 53,191
  • 11
  • 86
  • 129