-3

I was doing some twitter mining, and pulled the json of tweets into python3 via pandas

before processing further, i noticed alot of the data was not consistent/clean or even useful to me (for now) so i used regex to make the string of tweet messages consistent or delete the offending item

below is that:

data['full_text'] = data['full_text'].replace('^@ABC(\\u2019s)*[ ,\n\t]*', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\n)', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\t)', '.', regex=True)
data['full_text'] = data['full_text'].replace('(\\u2018)|(\\u2019)', "'", 
regex=True)
data['full_text'] = data['full_text'].replace('(\\u201c)|(\\u201d)', "\"", regex=True)
data['full_text'] = data['full_text'].replace('(\\n)|(\\t)', '', regex=True)

i.e. - remove twitter handle if used at beginning (including punctuation linked to it) - json should have no issue with apostrophes. Keep everything consistent and replace unicode for left/right apostrophe with single ' -some tweets have backslash for quote, others use unicode. keep consistent and replace unicode with \" -delete all tabs -assume all new lines are new sentences so replace them with a fullstop

as far as I'm aware, this this is all that is really needed. things like ~ are likely to be useless, with no real purpose to them. The tweets will also have emoticons that i dont care about (for now)

the rest of the punctuation and these emoticons follow the format \uXXXX where x is a number or letter

so my last line was planning to be the below:

data['full_text'] = data['full_text'].replace('(\\u\w\w\w\w)', "", regex=True)

given the large number of tweets i have, i cant verify if everything worked correctly, which is why if anyone could give some advice?

From my research i kept seeing people post things like:

([\u2600-\u27BF])|([\uD83C][\uDF00-\uDFFF])|([\uD83D][\uDC00-\uDE4F])|([\uD83D][\uDE80-\uDEFF]) 

but when i try these, i also still see emoticons etc left in the json. So why not just use \u\w\w\w\w ??? (especially when used at the end?)

===================================================================== update:

data['full_text'] = data['full_text'].replace('^@ABC(\\u2019s)*[ ,\n\t]*', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\n)', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\t)', '.', regex=True)
data['full_text'] = data['full_text'].replace('(\\u2018)|(\\u2019)', "'", regex=True) 
data['full_text'] = data['full_text'].replace('(\\u201c)|(\\u201d)', "\"", regex=True)
data['full_text'] = data['full_text'].replace('https:\/\/t.co\/(\w{10})', "", regex=True)
import string
data['full_text'] = data['full_text'].replace('[^{}]'.format(string.printable), '', regex=True)

It works thanks to James, although I'm getting conflicting information. Is the last line appropriate? is it deleting anything more than just unicode?

user3120554
  • 641
  • 2
  • 11
  • 21

1 Answers1

2

It looks like you have a misunderstanding of unicode. Unicode is a standard for describing characters/text/emoji/pictoglyphs/etc. That's it. For example,

  • the unicode standard for character 0041 (the 65th character since unicode is in hexidecimal) is "the Latin capital letter A".
  • the unicode standard for character 2600 is "black sun with rays".

So that's it. Unicode gives a description of what the character should be. It is up to the particular font and encoding to determine if the character is even displayed and what it looks like on screen. For my particular setup (Windows 10, Consolas font in the terminal) Consolas does not have a character that represents '\u2600', so it just displays the default 'missing' character of the confussed Tofu (a box with a question mark in the center).

So how does this relate to your question? The string '\u2600' is not 5 characters but a single character, represented by its unicode hexidecimal code point. That is why a regex of \u\w\w\w\w will not work, because it it looking for 5 characters, but each unicode character is only a single character.

You can test it yourself.

len('\u2600')
# returns
1

If you really want to remove all non-ascii characters, you can just filter out the text you don't want.

import string

df['full_text'] = df['full_text'].replace('[^{}]'.format(string.printable), '', regex=True)
James
  • 32,991
  • 4
  • 47
  • 70
  • i understand that it can represent one character i.e. \u2019 \2018 represent apostrophe characters, etc and from what the json looks like an emoticon looks like a pair of these so: \ud83d\ude00 = smiley. but they all still follow the format \uXXXX. that's why I try and replace each unicode representation of a character with something that doesn't need unicode, (or be deleted). I think the main part of my confusion is that most of my code works (unless I've overlooked somewhere) except for that last line – user3120554 Oct 21 '17 at 04:38
  • am i not following the same logic as people like this (only difference is their code is safer because they look for unicode within a list instead of a blanket delete like mine but then their code does not handle all emoji): https://stackoverflow.com/questions/13729638/how-can-i-filter-emoji-characters-from-my-input-so-i-can-save-in-mysql-5-5/13752628#13752628 – user3120554 Oct 21 '17 at 04:42
  • so if i wanted a json without unicode, (which helps with readability and less likely to cause issues when being pulled into programs), surely it's fine to remove anything with \u +4 extra characters before saving back into json? – user3120554 Oct 21 '17 at 04:45
  • 1
    Yes, the question is how you plan to go about doing that. You cannot search for `\u` followed by 4 characters, because that is not what the string actually contains. – James Oct 21 '17 at 04:48
  • i forgot to elaborate that I'm saving this back into json. i understand that once read in (to my particular program) that the unicode will be replaced by their appropriate character. But if i open in notepad++ i can see the unicode, which doesn't help with work (e.g. maybe i want to just copy paste into excel where the unicode wont be converted). emoticons i don't really need, and it doesn't sound like i really need unicode anyway (e.g. \" is probably easier to handle for text analysis than \u201c and \201d) – user3120554 Oct 21 '17 at 04:57
  • what would you suggest i do to resolve this? since it appears to work for my code (e.g. I'm able to replace \u201d) it's just that the last line doesn't seem to work (i get that you're saying the unicode represents one character, but that doesnt explain why the rest of my code seems to work despite that) – user3120554 Oct 21 '17 at 05:00
  • the link i posted searches for \u and 4 characters (within a list), why is my case different? https://stackoverflow.com/questions/13729638/how-can-i-filter-emoji-characters-from-my-input-so-i-can-save-in-mysql-5-5/13752628#13752628 – user3120554 Oct 21 '17 at 05:01
  • My guess is that the OP actually has ASCII text containing Unicode escape sequences, so doing stuff like `s.replace('\\u201c', '"')` will work. – PM 2Ring Oct 21 '17 at 05:13
  • @PM2Ring would you be able to give some advice on why my last line doesn't work while everything else (apparently) does? i still see unicode even after trying to remove \\u\w\w\w\w – user3120554 Oct 21 '17 at 05:25
  • i noticed @James edit and '[^{}]'.format(string.printable) seems to work! i searched for \u in notepad++ and found no traces! what does this do? It works so well I'm wondering if it is deleting anything else? – user3120554 Oct 21 '17 at 05:30
  • after printing ('[^{}]'.format(string.printable)) to console i get this which I don't quite understand what it does. You're searching for something not in that list? [^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ] – user3120554 Oct 21 '17 at 05:43