2

I used to code in R, but have recently switched back to Python. For a research project about hate speech, I like to display and store messages from Telegram channels with telethon in a dataframe. I need to store the data because I want to visualise and analyse it computationally. I am used to pandas dataframes, but happy with other alternatives too. I am using Python 3.7 with Spyder IDE.

With this tutorial I can get and display the messages within a channel I am a member of.

from telethon.sync import TelegramClient

name = 'anon' 
api_id = 'myAPI_ID'
api_hash = "myAPI_hash" 
chat = 'chat_link'

async with TelegramClient(name, api_id, api_hash) as client:
    async for message in client.iter_messages(chat):
        print(message.sender_id, ':', message.text)

I thought I can just create a new variable to store the displayed data, but I have discovered that it is not as trivial, in part also due the coroutines. Below code line creates a new variable but I cannot work out how to store the data in a (pandas) dataframe. I am not even sure if it stores the correct type of data.

participants = message.sender_id

While the Telethon documentation is explains really nicely how to display messages, there is no example how to store messages. I am aware that the same question has been asked before, but without an answer. I have also looked at this tutorial that explains how to mine and store messages, but I cannot make it work. The first problem arises with the 5th code line [Telegram]. Even if I patch different lines of codes together, the GetParticipantsRequest command does not work for channels where I am not an admin.

How to proceed to store the displayed messages and user IDs in a dataframe?

Thanks for your help.

Simone
  • 497
  • 5
  • 19

1 Answers1

1

Your question is more about Python and Pandas than Telegram and Telethon, as far as I can understand.

from telethon.sync import TelegramClient

name = 'anon' 
api_id = 'myAPI_ID'
api_hash = "myAPI_hash" 
chat = 'chat_link'

async with TelegramClient(name, api_id, api_hash) as client:
    async for message in client.iter_messages(chat):
        print(message.sender_id, ':', message.text)

With this code, you are iterating the messages of a Telegram chat and then printing the ID of the send and the message text.

To store them in a variable, you just have to change

print(message.sender_id, ':', message.text)

to

sender, text = message.sender_id, message.text

You can append your data to a list and then save it to a pandas dataframe.

Combining this all together

import pandas as pd
from telethon.sync import TelegramClient

name = 'anon' 
api_id = 'myAPI_ID' 
api_hash = 'myAPI_hash' 
chat = 'chat_link'

data = [] # stores all our data in the format SENDER_ID, MSG

async with TelegramClient(name, api_id, api_hash) as client:
    async for message in client.iter_messages(chat):
        data.append([message.sender_id, message.text])


df = pd.DataFrame(data, columns=['SENDER', 'MESSAGE']) # creates a new dataframe


df.to_csv('filename.csv', encoding='utf-8') # save to a CSV file

Note: Take care of API limits when iterating messages in a chat.

Prashant Sengar
  • 506
  • 1
  • 7
  • 24
  • Thanks for sharing Prashant. The code works to store the data. I need to wrangle with the formatting as it does not always put sender ID and message in a separate cell. Also many empty rows. – Simone Jun 09 '21 at 11:04
  • 1
    The minus in front of sender IDs cause a problem for pandas CSV writing. Used to command to write an excel file instead. That seemed quicker and simpler than doing some regex to delete the minus. Though this solution may not work in future. Got a warning message. – Simone Jun 09 '21 at 11:45
  • What about persisting the whole `message` object? Trying something like `data.append(message)` and then `pickle` `data` doesn't work... – Dror Jun 14 '22 at 12:56