Compare two CSV files and search for similar emoji

Question

Let's say I have two CSV files each containing different emojis.

The first CSV contains a list of all emojis with corresponding Unicodes (shortened for the sake of question here):

Emojis     Unicode
         1F600
         1F603
         1F604
         1F601
         1F606
         1F605
         1F525
✨         2728

The second CSV contains a shorter list of random emojis drawn from Twitter posts

Emojis     Freq.     
          45
           3
          93
          39
✨          35

I am trying to figure out a solution that will create a new column with Unicode corresponding to the emoji in each row to the SECOND CSV file. End result would be somewhat like this:

Emojis     Freq.     Unicode     
          45       1F600
           3        1F525
          93       1F603
          39       1F601
✨          35       2728

The closest question that I could find was here but it did not work in my case...

I am using Python 3.9

Emojis are just text, you can use them as dictionary keys – Boris Verkhovskiy Dec 14 '20 at 11:59 — Boris Verkhovskiy, Dec 14 '20 at 11:59

snakecharmerb · Answer 1 · 2020-12-14T11:53:12.467

2

You could read the second csv into a single dict, then filter the first based on that dict.

with open('freqs.csv', newline='') as f:
    reader = csv.reader(f)
    # Skip the header row, if there is one
    next(reader)
    freqs = {emoji: frequency for (emoji, frequency) in reader}


with open('emoji.csv', newline='') as f:
    reader = csv.reader(f)
    # skip the header
    next(reader)
    # Find the rows that have a matching frequency 
    filtered_rows = [row for row in reader if row[0] in freqs.keys()]

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Emojis', 'freq', 'unicode'])
    for row in filtered_rows:
        emoji = row[0]
        row.insert(1, freqs[emoji])
        writer.writerow(row)

edited Dec 14 '20 at 11:53

answered Dec 14 '20 at 08:06

snakecharmerb

47,570
11
100
153

1

Upvoted and stolen, with adaptations, into my answer. – tripleee Dec 14 '20 at 12:52
@tripleee No problem :-) I've learned a lot from your unicode answers. – snakecharmerb Dec 14 '20 at 13:02
@snakecharmerb I tried your solution but nothing was getting written on 'output.csv'. Checked to see what was wrong, and apparently, the filtered_rows list is empty. Not sure if that is supposed to happen. – Kosu K. Dec 14 '20 at 19:46
@KosuK. the code works for me, using the data in the question. Things to check might include: the file structure is the same as in the question (no extra columns or differences in order); there is no leading or trailing whitespace in the values; as tripleee observes, you may need to normalise the emoji to ensure everything matches up. Also print out the freqs dictionary; if that looks good, then there must be a mismatch between its keys and the values in the emoji csv. – snakecharmerb Dec 14 '20 at 20:03

tripleee · Answer 2 · 2020-12-14T12:57:48.780

A complication is that e.g. Tweepy typically provides you with emojis in a weird and in fact invalid surrogate format (in that UTF-8 explicitly forbids the use of surrogates, which are really only meant as a compatibility hack for UTF-16). You will need to perform Unicode normalization to properly compare two Unicode strings, and on top of that, handle surrogates if they are present in your input.

Here's an adaptation of snakecharmerb's answer with this addition, and also I changed the second loop to just manipulate one row at a time.

from unicodedata import normalize
import csv

defun normalize_with_surrogate(s):
    "See https://stackoverflow.com/a/54549164/874188"
    return normalize('NFKD', s.encode('utf-16', 'surrogatepass').decode('utf-16'))

with open('freqs.csv', newline='') as f:
    reader = csv.reader(f)
    next(reader)
    freqs = {normalize_with_surrogate(emoji): frequency for (emoji, frequency) in reader}

with open('emoji.csv', newline='') as f, open('output.csv', 'w', newline='') as o:
    reader = csv.reader(f)
    writer = csv.writer(o)
    next(reader)
    # The world will be a more beautiful place without this
    writer.writerow(['Emojis', 'freq', 'unicode'])

    for row in reader:
        emoji = normalize_with_surrogate(row[0])
        if emoji in freqs:
            row.insert(1, freqs[emoji])
            writer.writerow(row)

Of course, if the Emojis file really simply contains the code point for each emoji, you don't need that file at all; simply print '%05X' % ord(emoji) (the file seems to be wrong, too; the first one in your example is actually U+1F603).

score 0 · Answer 3 · answered Dec 14 '20 at 02:33

You can use pandas to do a left join (merge) to return the matching unicode values.

import pandas as pd
df_all = pd.read_csv('all_emojis.csv')
df_twitter = pd.read_csv('twitter_emojis.csv')

output = df_twitter.merge(df_all, on='Emojis', how='left')

# To write to csv
output.to_csv('twitter_emojis_with_unicode.csv', index=False)

Compare two CSV files and search for similar emoji

3 Answers3