0

I am working on a project for school, but now with online instruction it is much harder to get help. I have a dataset in excel and there are links and emojis that I need to remove.

This is what my data looks like now. I want to get rid of the https://t.co/....... link, the emojis and some of the weird characters.

screenshot of twitter data

Does anyone have any suggestions on how to do this in excel? or maybe python?

D45
  • 101
  • 1
  • 8

2 Answers2

0

According to this reference, I believe you could do a function like this:

def checkChars(inputString):
    outputString = ""
    allowedChars = [" ", "/", ":", ".", ",",";"] # The characters you want to include
    for l in inputString:
        if l.isalnum() or l in allowedChars: # This line will check if the character is alphanumeric or is in your allowed character list
            outputString += l
    return outputString
Nimantha
  • 6,405
  • 6
  • 28
  • 69
EnriqueBet
  • 1,482
  • 2
  • 15
  • 23
0

I'm not sure how to do it in Excel, however, you can easily load the Excel file into 'pandas.dataFrame' and then use regex to ignore the non-ascii chars:

file_path = '/some/path/to/file.xlsx'
df = pd.read_excel(file_path , index_col=0) 
df = df.replace(r'\W+', '', regex=True)

Here you can find an extra explanation about loading an Excel file into a dataframe Here you can read about more ways to ignore non-ascii chars in dataframe