3

I'm writing a program to split the words contained in an hashtag.

For example I want to split the hashtags:

#Whatthehello #goback

into:

What the hello go back

I'm having troubles when using re.sub with a functional argument.

The code I've written is:

import re,pdb

def func_replace(each_func):
    i=0
    wordsineach_func=[] 
    while len(each_func) >0:
        i=i+1
        word_found=longest_word(each_func)
        if len(word_found)>0:
            wordsineach_func.append(word_found)
            each_func=each_func.replace(word_found,"")
    return ' '.join(wordsineach_func)

def longest_word(phrase):
    phrase_length=len(phrase)
    words_found=[];index=0
    outerstring=""
    while index < phrase_length:
        outerstring=outerstring+phrase[index]
        index=index+1
        if outerstring in words or outerstring.lower() in words:
            words_found.append(outerstring)
    if len(words_found) ==0:
        words_found.append(phrase)
    return max(words_found, key=len)        

words=[]
# The file corncob_lowercase.txt contains a list of dictionary words
with open('corncob_lowercase.txt') as f:
    read_words=f.readlines()

for read_word in read_words:
    words.append(read_word.replace("\n","").replace("\r",""))

For example when using these functions like this:

s="#Whatthehello #goback"

#checking if the function is able to segment words
hashtags=re.findall(r"#(\w+)", s)
print func_replace(hashtags[0])

# using the function for re.sub
print re.sub(r"#(\w+)", lambda m: func_replace(m.group()), s)

The output I obtain is:

What the hello
#Whatthehello #goback

Which is not the output I had expected:

What the hello
What the hello go back

Why is this happening? In particular I've used the suggestion from this answer but I don't understand what goes wrong in this code.

Community
  • 1
  • 1
Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142
  • Hmmm.. what is the problem. Why the downvote? It is about programming!! – Abhishek Bhatia Feb 10 '16 at 16:37
  • It's good to be syntethic but your question should at least be readable. Use English sentences not summaries like "aim: do this. Code: ..; output ..; why? see here". – Bakuriu Feb 10 '16 at 19:39
  • @Bakuriu Thanks for the edit! I will that in mind asking again. – Abhishek Bhatia Feb 10 '16 at 19:54
  • I just wanted to give an example of how to write a good question. You did a good job at providing the complete code with output and what did you expect, but you should put at least a pragraph of text describing what you want to do (maybe why, a little background) and how the code fit into this. In this way your question will be attract more and be more useful. – Bakuriu Feb 10 '16 at 19:59

1 Answers1

5

Notice that m.group() returns the entire string that matched, whether or not it was part of a capturing group:

In [19]: m = re.search(r"#(\w+)", s)

In [20]: m.group()
Out[20]: '#Whatthehello'

m.group(0) also returns the entire match:

In [23]: m.group(0)
Out[23]: '#Whatthehello'

In contrast, m.groups() returns all capturing groups:

In [21]: m.groups()
Out[21]: ('Whatthehello',)

and m.group(1) returns the first capturing group:

In [22]: m.group(1)
Out[22]: 'Whatthehello'

So the problem in your code originates with the use of m.group in

re.sub(r"#(\w+)", lambda m: func_replace(m.group()), s)

since

In [7]: re.search(r"#(\w+)", s).group()
Out[7]: '#Whatthehello'

whereas if you had used .group(1), you would have gotten

In [24]: re.search(r"#(\w+)", s).group(1)
Out[24]: 'Whatthehello'

and the preceding # makes all the difference:

In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'

In [26]: func_replace('Whatthehello')
Out[26]: 'What the hello'

Thus, changing m.group() to m.group(1), and substituting /usr/share/dict/words for corncob_lowercase.txt,

import re

def func_replace(each_func):
    i = 0
    wordsineach_func = []
    while len(each_func) > 0:
        i = i + 1
        word_found = longest_word(each_func)
        if len(word_found) > 0:
            wordsineach_func.append(word_found)
            each_func = each_func.replace(word_found, "")
    return ' '.join(wordsineach_func)


def longest_word(phrase):
    phrase_length = len(phrase)
    words_found = []
    index = 0
    outerstring = ""
    while index < phrase_length:
        outerstring = outerstring + phrase[index]
        index = index + 1
        if outerstring in words or outerstring.lower() in words:
            words_found.append(outerstring)
    if len(words_found) == 0:
        words_found.append(phrase)
    return max(words_found, key=len)

words = []
# corncob_lowercase.txt contains a list of dictionary words
with open('/usr/share/dict/words', 'rb') as f:
    for read_word in f:
        words.append(read_word.strip())
s = "#Whatthehello #goback"
hashtags = re.findall(r"#(\w+)", s)
print func_replace(hashtags[0])
print re.sub(r"#(\w+)", lambda m: func_replace(m.group(1)), s)

prints

What the hello
What the hello gob a c k

since, alas, 'gob' is longer than 'go'.


One way you could have debugged this is to replace the lambda function with a regular function and then add print statements:

def foo(m):
    result = func_replace(m.group())
    print(m.group(), result)
    return result

In [35]: re.sub(r"#(\w+)", foo, s)
('#Whatthehello', '#Whatthehello')   <-- This shows you what `m.group()` and `func_replace(m.group())` returns
('#goback', '#goback')
Out[35]: '#Whatthehello #goback'

That would focus your attention on

In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'

which you could then compare with

In [26]: func_replace(hashtags[0])
Out[26]: 'What the hello'

In [27]: func_replace('Whatthehello')
Out[27]: 'What the hello'

That would lead you to ask the question, if m.group() returns '#Whatthehello', what method do I need to return 'Whatthehello'. A dive into the docs then solves the problem.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks! This is the best explained answer I have read so far. – Abhishek Bhatia Feb 10 '16 at 18:32
  • Using the interpreter to explain the problem step by step was brilliant. Thanks. Once you understand the problem, the solution just jumps at you. Moreover, you can carry what you understood to your future coding endeavors. – pembeci Oct 02 '18 at 13:02