2

my problem is the follow: I want to do a sentiment analysis on Italian tweet and I would to tokenise and lemmatise my Italian text in order to find new analysis dimension for my thesis. The problem is that I would like to tokenise my hashtag, splitting also the composed one. For example if I have #nogreenpass, I would have also without the # symbol, because the sentiment of the phrase would be better understand with all word of the text. How could I do this? I tried with sapCy, but I have no results. I created a function to clean my text, but I can't have the hashtag in the way I would. I'm using this code:

import re
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load('it_core_news_lg')

# Clean_text function
def clean_text(text):
    text = str(text).lower()
    doc = nlp(text)
    text = re.sub(r'#[a-z0-9]+', str(' '.join(t in nlp(doc))), str(text))
    text = re.sub(r'\n', ' ', str(text)) # Remove /n
    text = re.sub(r'@[A-Za-z0-9]+', '<user>', str(text)) # Remove and replace @mention
    text = re.sub(r'RT[\s]+', '', str(text)) # Remove RT
    text = re.sub(r'https?:\/\/\S+', '<url>', str(text)) # Remove and replace links
    return text

For example here I don't know how add the first < and last > replacing the # symbol and the tokenisation process doesn't work as I would. Thank you for the time spent for me and for the patience. I hope to became stronger in the Jupiter analysis and python coding so I could give an help also to your problem. Thank you guys!

Emiliano Viotti
  • 1,619
  • 2
  • 16
  • 30
Jhonny
  • 162
  • 9
  • 1
    What you have here is not related to spacy, but to regex. Could you please provide a sample string and expected output? – Wiktor Stribiżew Dec 10 '21 at 13:02
  • Please check https://ideone.com/pxZqeK - does it work as expected? – Wiktor Stribiżew Dec 10 '21 at 13:12
  • @WiktorStribiżew thank you for the answer. It doesn't work as I would. For example with this string: "@Marcorossi hanno ragione I #novax http://www.asfag.com", I would have this output: " hanno ragione I " I thought to spaCy because I want that composed hashtag would be separated and inserted in two brackets like this <>. Thank you for your time – Jhonny Dec 10 '21 at 14:01
  • Then, it is a matter of adding `<` and `>` into replacement, `re.sub(r'#(\w+)', r'<\1>', text)`. See https://ideone.com/uG0YCW – Wiktor Stribiżew Dec 10 '21 at 14:08
  • @WiktorStribiżew only one think I couldn't do: separare the word novax. In this way I have but I would , for that I thought to spaCy. – Jhonny Dec 10 '21 at 14:11
  • Spacy does not fix any typos for you if that is what you mean. – Wiktor Stribiżew Dec 10 '21 at 14:12
  • I thought to tokenise the composed words in order to join them with a ''.join method based on a nlp(tokenize). If this doesn't fit my wills, how could I split a composed hashtag in the his elementary words? To be cleaner for example I would this hashtag #noallascuoladellaviolenza as . Thank you again – Jhonny Dec 10 '21 at 14:22
  • There is no way to split a glued word into its constituent words in a simple way. See [How to split text without spaces into list of words](https://stackoverflow.com/q/8870261/3832970). – Wiktor Stribiżew Dec 10 '21 at 14:25

1 Answers1

3

You can tweak your current clean_code with

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'#(\w+)', r'<\1>', text)
    text = re.sub(r'\n', ' ', text) # Remove /n
    text = re.sub(r'@[A-Za-z0-9]+', '<user>', text) # Remove and replace @mention
    text = re.sub(r'RT\s+', '', text) # Remove RT
    text = re.sub(r'https?://\S+\b/?', '<url>', text) # Remove and replace links
    return text

See the Python demo online.

The following line of code:

print(clean_text("@Marcorossi hanno ragione I #novax htt"+"p://www.asfag.com/"))

will yield

<user> hanno ragione i <novax> <url>

Note there is no easy way to split a glued string into its constituent words. See How to split text without spaces into list of words for ideas how to do that.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563