I have a csv file with 60000+ tweets. I have cleaned the file to a certain extent. But it still has words (mixed characters probably left out after urls cleaning) that do not make any sense. I am not allowed to post any images. So, I am posting a portion of the file. """
Fintech Bitcoin crowdfunding and cybersecurity fintech bitcoin crowdfunding and cybersecurity
monster has left earned total satoshi monstercoingame Bitcoin
Bitcoin TCH bitcoin btch
bitcoin iticoin SPPL BXsAJ
coindesk The latest Bitcoin Price Index USD pic twitter com aKk
Trends For Bitcoin Regulation ZKdFZS via CoinDeskpic twitter com KNKgFcdxYD
Now there Mike Tyson Bitcoin app theres mike tyson bitcoin app
BitcoinBet Positive and negative proofs blockchain audits Bitcoin Bitcoin via
The latest Bitcoin Price Index USD pic twitter com CivXlPj
Bitcoin price index pic twitter com xhQQ mbRIb
As you can see some characters (for example, aKk, KNKgFcdxYD, xhQQ) don't make any sense, so I want to remove them. They are stored in a column named [clean_tweet].
I have sort of stitched together the following code for the whole cleaning purpose (from raw tweets to the current version that I posted) but don't know how I could remove those "characters". My code is as follows. Any suggestions would be appreciated. Thank you.
import re
import pandas as pd
import numpy as np
import string
import nltk
from nltk.stem.porter import *
import warnings
from datetime import datetime as dt
warnings.filterwarnings("ignore", category=DeprecationWarning)
tweets = pd.read_csv(r'myfilepath.csv')
df = pd.DataFrame(tweets, columns = ['date','text'])
df['date'] = pd.to_datetime(df['date']).dt.date #changing date to datetime format from time-series
#removing pattern from tweets
def remove_pattern(input_txt, pattern):
r = re.findall(pattern, input_txt)
for i in r:
input_txt = re.sub(i, '', input_txt)
return input_txt
# remove twitter handles (@user)
tweets['clean_tweet'] = np.vectorize(remove_pattern)(tweets['text'], "@[\w]*")
#remove urls
tweets['clean_tweet'] = np.vectorize(remove_pattern)(tweets['text'], "https?://[A-Za-z./]*")
## remove special characters, numbers, punctuations
tweets['clean_tweet'] = tweets['clean_tweet'].str.replace("[^a-zA-Z#]", " ")
#
tweets['clean_tweet'] = tweets['clean_tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))