0

I'm taking a look at a twitter dataset and I encountered a problem when trying to remove the mentions from the tweets that have them. I tried the following:

echo ' "@user lol I needed it! went to sleep around 3am and woke up around 5 am! lol horrible! "' | \
sed 's/@.*[[:blank:]]//g'

My expected output is "lol I needed it! went to sleep around 3am and woke up around 5 am! lol horrible! ", however I'm simply getting 2 quotations marks "". I find this really weird as the following dummy example works (outputs "zzz" "dfg"):

echo '"zzz" "@abc dfg"' | sed 's/@.*[[:blank:]]//g'

I'm using GNU sed and the database I'm looking at can be downloaded here: http://help.sentiment140.com/for-students/. Any ideas of why this might be failing?

MikeKatz45
  • 545
  • 5
  • 16
  • `.*[[:blank:]]` matches *the longest possible string* with a space after it. Nothing in your regex stops the string itself from containing spaces, so everything from the `@` to the last space in the line gets matched. – Charles Duffy Nov 25 '19 at 00:39
  • ...so, in your case, you probably want to match not `@.*` but `@[^[:blank:]]*`, such that your string can contain *only non-blank* characters. – Charles Duffy Nov 25 '19 at 00:40

0 Answers0