0

I have a csv file with 60000+ tweets. I have cleaned the file to a certain extent. But it still has words (mixed characters probably left out after urls cleaning) that do not make any sense. I am not allowed to post any images. So, I am posting a portion of the file. """

Fintech Bitcoin crowdfunding and cybersecurity fintech bitcoin crowdfunding and cybersecurity
monster has left earned total satoshi monstercoingame Bitcoin
Bitcoin TCH bitcoin btch
bitcoin iticoin SPPL BXsAJ
coindesk The latest Bitcoin Price Index USD pic twitter com aKk
Trends For Bitcoin Regulation ZKdFZS via CoinDeskpic twitter com KNKgFcdxYD
Now there Mike Tyson Bitcoin app theres mike tyson bitcoin app
BitcoinBet Positive and negative proofs blockchain audits Bitcoin Bitcoin via
The latest Bitcoin Price Index USD pic twitter com CivXlPj
Bitcoin price index pic twitter com xhQQ mbRIb

As you can see some characters (for example, aKk, KNKgFcdxYD, xhQQ) don't make any sense, so I want to remove them. They are stored in a column named [clean_tweet].

I have sort of stitched together the following code for the whole cleaning purpose (from raw tweets to the current version that I posted) but don't know how I could remove those "characters". My code is as follows. Any suggestions would be appreciated. Thank you.

import re
import pandas as pd 
import numpy as np 
import string
import nltk
from nltk.stem.porter import *
import warnings 
from datetime import datetime as dt

warnings.filterwarnings("ignore", category=DeprecationWarning)

tweets = pd.read_csv(r'myfilepath.csv')
df = pd.DataFrame(tweets, columns = ['date','text'])

df['date'] = pd.to_datetime(df['date']).dt.date #changing date to datetime format from time-series

#removing pattern from tweets

def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
    return input_txt   

# remove twitter handles (@user)
tweets['clean_tweet'] = np.vectorize(remove_pattern)(tweets['text'], "@[\w]*")
#remove urls    
tweets['clean_tweet'] = np.vectorize(remove_pattern)(tweets['text'], "https?://[A-Za-z./]*")

## remove special characters, numbers, punctuations
tweets['clean_tweet'] = tweets['clean_tweet'].str.replace("[^a-zA-Z#]", " ")
#      
tweets['clean_tweet'] = tweets['clean_tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))  
non_linear
  • 413
  • 7
  • 15
  • 2
    [Please, don't post images of text.](https://unix.meta.stackexchange.com/questions/4086/psa-please-dont-post-images-of-text) – accdias Jan 18 '20 at 15:11
  • are these the only three words ot just example? – Ahmed Sunny Jan 18 '20 at 15:23
  • Welcome to SO! Please take a moment to read about how to post pandas questions: http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – YOLO Jan 18 '20 at 15:25
  • `(r'http.?://[^\s]+[\s]?',)` and for https use `(r'https.?://[^\s]+[\s]?')` use these patterns – Ahmed Sunny Jan 18 '20 at 15:27
  • 2
    @AhmedSunny, your first pattern already matches `https` as well. There is no need to have that second one. In fact the second doesn't make any sense because it will try to match `https` followed by another any character BEFORE the `:`. For example `httpsa://`. – accdias Jan 18 '20 at 15:29
  • Assuming you are processing Tweets in English, does this answer your question? [How to check if a word is an English word with Python?](https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python) – accdias Jan 18 '20 at 15:37
  • @AhmedSunny these are just examples. There are probably thousands of them in the file. – non_linear Jan 18 '20 at 15:37
  • @accdias I am processing tweets in English. The link that you provided could be the first step. I need to check if these characters (words) exist, if they don't (should be expected), I will have to remove them. But don't think it is going to work as there are thousands of these words in the file so will have to check manually? – non_linear Jan 18 '20 at 16:02
  • @Rasel, I posted an answer with a starting point of what you will need. I'm quite sure there is a way to "teach" the spell checker new words like those one it marked as invalid in English. Anyway, I still think a spell checker is your best bet. – accdias Jan 18 '20 at 16:04

3 Answers3

0

It might be easier to qualify the characters you want instead of the universe of unwanted ones. Negative matching with regex?

    if (re.match(r'[A-Za-z0-9@#$%^&*()!-+='";:?', char) is not None) is False:
         replace(char, '')

Clean something like that regex up a little bit for what you're looking for and just loop through the characters of each string. And then, thank God for computers to do all the tedium work for you!

Kyle Hurst
  • 252
  • 2
  • 6
0

Following up my comments, I guess your task will become easier if you use an spell checker library to see if the words are valid in English or not.

Something like this (using enchant, for example):

import enchant
from pprint import pprint

en_us = enchant.Dict("en_US")
text = '''
Fintech Bitcoin crowdfunding and cybersecurity fintech bitcoin crowdfunding and cybersecurity
monster has left earned total satoshi monstercoingame Bitcoin
Bitcoin TCH bitcoin btch
bitcoin iticoin SPPL BXsAJ
coindesk The latest Bitcoin Price Index USD pic twitter com aKk
Trends For Bitcoin Regulation ZKdFZS via CoinDeskpic twitter com KNKgFcdxYD
Now there Mike Tyson Bitcoin app theres mike tyson bitcoin app
BitcoinBet Positive and negative proofs blockchain audits Bitcoin Bitcoin via
The latest Bitcoin Price Index USD pic twitter com CivXlPj
Bitcoin price index pic twitter com xhQQ mbRIb
'''
phrases = text.split('\n')
print('BEFORE')
pprint(phrases)

for i, phrase in enumerate(phrases):
    phrases[i] = ' '.join(w for w in phrase.split() if en_us.check(w))

print('AFTER')
pprint(phrases)

The code above will result in something like:

BEFORE
['',
 'Fintech Bitcoin crowdfunding and cybersecurity fintech bitcoin crowdfunding '
 'and cybersecurity',
 'monster has left earned total satoshi monstercoingame Bitcoin',
 'Bitcoin TCH bitcoin btch',
 'bitcoin iticoin SPPL BXsAJ',
 'coindesk The latest Bitcoin Price Index USD pic twitter com aKk',
 'Trends For Bitcoin Regulation ZKdFZS via CoinDeskpic twitter com KNKgFcdxYD',
 'Now there Mike Tyson Bitcoin app theres mike tyson bitcoin app',
 'BitcoinBet Positive and negative proofs blockchain audits Bitcoin Bitcoin '
 'via',
 'The latest Bitcoin Price Index USD pic twitter com CivXlPj',
 'Bitcoin price index pic twitter com xhQQ mbRIb',
 '']
AFTER
['',
 'Bitcoin and bitcoin and',
 'monster has left earned total Bitcoin',
 'Bitcoin bitcoin',
 'bitcoin',
 'The latest Bitcoin Price Index pic twitter com',
 'Trends For Bitcoin Regulation via twitter com',
 'Now there Mike Tyson Bitcoin app mike bitcoin app',
 'Positive and negative proofs audits Bitcoin Bitcoin via',
 'The latest Bitcoin Price Index pic twitter com',
 'Bitcoin price index pic twitter com',
 '']

BUT, as you can see, words like Fintech, crowdfunding, and cybersecurity (to list a few) were marked as NOT valid in English so you will need to fine tuning the code for your needs.

I hope it helps.

Update: to add word exceptions to your spell checker, do something like this:

exceptions = [
    'Fintech',
    'crowdfunding',
    'cybersecurity',
    'fintech',
    'crowdfunding',
    'cybersecurity',
    'satoshi',
    'monstercoingame',
    'TCH',
    'coindesk',
    'USD',
    'CoinDeskpic',
    'theres',
    'tyson',
    'BitcoinBet',
    'blockchain',
    'USD'
]

for word in exceptions:
    # add word to personal dictionary
    #en_us.add(word)
    # or add word just for this session only
    en_us.add_to_session(word)
accdias
  • 5,160
  • 3
  • 19
  • 31
  • Sorry for the late reply. I was trying other methods as well. The spell checking method will work on a small file. But, apart from the 60000+ tweets file, I have more than 10 million tweets in my primary csv file. I doubt it would be possible to put exceptions for millions of tweets. I will keep the question open if anyone knows how to scale it. Thanks for your answer though. – non_linear Jan 21 '20 at 21:07
  • OK. No worries. But I guess you can also find a list of English neologisms somewhere on the Internet and use that as your appendix for the dictionary. – accdias Jan 21 '20 at 23:06
  • Something like [12dicts](http://wordlist.aspell.net/12dicts/) perhaps. – accdias Jan 21 '20 at 23:11
-1

there is a way to do it, by using nltk it will also remove url.

url need to be remopve first otherwise you will it will remove some words from url and make it worse

nltk.download('words') # if its needed
words = set(nltk.corpus.words.words())

def clean_tweets(text):
    text= re.sub(r'https.?://[^\s]+[\s]?', '', text)
    return " ".join(w for w in nltk.wordpunct_tokenize(text) \
     if w.lower() in words or not w.isalpha())

this will remove the nonsense words example

test = 'this is a  test KNKgFcdxYD to check https://stackoverflow.com/questions/295 xhQQ'
ret = clean_tweets(test)
print(ret)
# output
#this is a test to check
Ahmed Sunny
  • 2,160
  • 1
  • 21
  • 27
  • the reason fora down vote would be much appreciated. because i tested it before posting here, and it worked fine in this regard – Ahmed Sunny Jan 19 '20 at 12:08
  • I don't know about the down vote. But your code will remove all the hashtagged words as well together with the special characters. Just tasted on a different tweet. – non_linear Jan 21 '20 at 21:21