0

I have this code:

Long_string = """
"Fifty Shades of Grey” shakeup: Kelly Marcel not returning for Sequel
"""

I need to break down the string into words. I do:

text_to_list = testing.split()

the output is:

['\xa1\xb0Fifty', 'Shades', 'of', 'Grey\xa1\xb1', 'shakeup:', 'Kelly', 'Marcel', 'not', 'returning', 'for', 'Sequel']

however some of those have special meanings when together, like the quoted “Fifty Shades of Grey”, and people’s name of together capitalized words like “Kelly Marcel”.

So I want to turn them into “Fifty-Shades-of-Grey” and “Kelly-Marcel” when they are split. How can I do that?


Sorry for the confusion. The need is to:

replace the space to "-" when it is:

  1. in between words quoted
  2. in between 2 capitalized words
halfer
  • 19,824
  • 17
  • 99
  • 186
Mark K
  • 8,767
  • 14
  • 58
  • 118
  • To do this, you need a very big English dictionary which contains all the movies and directors and actors, etc. – ljk321 Mar 27 '15 at 06:25
  • How your program knows that some of words have special meaning? If you have any list of special words, phrases, first replace them with the same words but with "-" instead of spaces. And then split new text. – Alexander R. Mar 27 '15 at 06:25
  • 1
    Python does not know what words are together and have "special meanings " . You need to either mark them in the string somehow (e.g. put them in parentheses), or make some sort of special words dictionary. – Marcin Mar 27 '15 at 06:26
  • skyline75489, Alexander Ravikovich and Marcin, thanks for the attentions. Sorry that I didn't make the question clear enough. Please see revision. – Mark K Mar 27 '15 at 06:34

4 Answers4

2

I would do this in three parts. First, using a tweaked version of the regex in this answer, replace the spaces between two capitalised words with a -:

>>> import re
>>> long_string = '"Fifty Shades of Grey" shakeup: Kelly Marcel not returning for Sequel'
>>> long_string = re.sub(r'([A-Z][a-z]+(?=\s[A-Z]))(?:\s([A-Z][a-z]+))+', r'\1-\2', long_string)
>>> long_string
'"Fifty-Shades of Grey" shakeup: Kelly-Marcel not returning for Sequel'

Then, use the shlex library to split but preserving the quotes:

>>> import shlex
>>> words = shlex.split(long_string)
>>> words
['Fifty-Shades of Grey',
 'shakeup:',
 'Kelly-Marcel',
 'not',
 'returning',
 'for',
 'Sequel']

Then use a list comprehension to replace all remaining spaces inside each token with a -:

>>> final = [x.replace(' ', '-') for x in words]
>>> final
['Fifty-Shades-of-Grey',
 'shakeup:',
 'Kelly-Marcel',
 'not',
 'returning',
 'for',
 'Sequel']
Community
  • 1
  • 1
Ben
  • 6,687
  • 2
  • 33
  • 46
1

you only need regexp to replace the space to "-" when it is in between words quoted.
here's an example

import re
Long_string = """
"Fifty Shades of Grey" shakeup: Kelly Marcel not returning for Sequel
"""
def check_sting(text):
    matches=re.findall(r'\"(.+?)\"|([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)',Long_string)
    for i in matches:
        for idx,val in enumerate(i):
            temp=i[idx].replace(" ","-")
            if(temp):
                yield temp
#
for j in check_sting(Long_string):
print(j)

well the above code might not be efficient, it's just to give you an example to show you that you can use regexp for string search pattern you can go through regexp and improve the above code

CY5
  • 1,551
  • 17
  • 23
1

A brute force, novice, non-regex code which accomplices the requirements:

Long_string = """
"Fifty Shades of Grey" shakeup: Kelly Marcel not returning for Sequel
"""
text_to_list = Long_string.split()

n_list = []
caps_started = 0
tmp_word = ''
for word in text_to_list:
    w_p=0
    if word[0] == '"':
        quotes_started = 1
        tmp_word += word
        w_p=1
        continue
    if quotes_started == 1:
        tmp_word += "-"+word
        w_p=1
    if word[-1:] == '"':
        quotes_started = 0
        n_list.append(tmp_word)
        tmp_word = ''
        w_p=1
        continue

    if quotes_started == 0:
        if word[0].isupper() and caps_started == 0:
            caps_started = 1
            tmp_word += word 
            w_p=1
            continue
        if caps_started == 1:
            tmp_word += "-"+word
            w_p=1
        if word[0].isupper() and caps_started == 1:
            caps_started = 0
            n_list.append(tmp_word)
            tmp_word = ''
            w_p=1
            continue
    if w_p == 0:
        n_list.append(word)


if tmp_word not in n_list:
    n_list.append(tmp_word)

print n_list
Anshu Prateek
  • 3,011
  • 1
  • 18
  • 33
1

This might help.(no need of regular expressions)

Long_string = """"Fifty Shades of Grey" shakeup: Kelly Marcel not returning for Sequel"""

previous_word_uppercase = 0
count = 0
buffer = ""
final_buffer = ""

text_to_list_prev = Long_string.split('"')

for i in text_to_list_prev:
    j = i
    if count%2 != 0:
        j = '"' + i.replace(" ", "-") +'"'
    buffer = buffer + j
    count += 1

text_to_list = buffer.split(" ")
text_to_list2 = buffer.split(" ")

previous_word_uppercase = 0
count = 0

for i in text_to_list:
    j = i
    if i[0].isupper():
        if previous_word_uppercase == 1:
            j = "-" + i
            final_buffer = final_buffer +j
        else:
            final_buffer = final_buffer +" "+j
        previous_word_uppercase = 1
    else:
        previous_word_uppercase = 0
        final_buffer = final_buffer +" "+j
    count = count +1

print(final_buffer)

OutPut

"Fifty-Shades-of-Grey" shakeup: Kelly-Marcel not returning for Sequel