-1

I have a large list which consist of approx 47234 English sentences and there are some emojis in my list. i will use this list to make chatbot, but i want to know that in which format e.g txt or csv or etc that i should store my list in a file, but the problem is that list also contains emojis. What should i do, in which format that i can easily retrieve?

Here is some content of my list:

['hi there', 'Hello!', 'Hi! How are you?', 'Not bad! And You?',
 "I'm doing well. Just got engaged to my high school sweetheart.",
 'Wowowowow! Congratulations! Is she pretty?', 
"She 's pretty cute. She invited me to dinner tonight. ",
 'Cool! Have a good time you both! And what is your hobby?',
 'I love music! I love Taylor swift. ']

I have tried this:

with open("file.txt", 'w') as output:
    for sentence in sentences:
        output.write(str(sentence) + '\n')

This code give the error message:

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f642' in position 54: character maps to <undefined>

It seems that this error message is due to emojis.

As i mentioned the length of the list is 47324 so using for loop is not a feasible solution.

Any help would be really appreciated

wjandrea
  • 28,235
  • 9
  • 60
  • 81
AS2
  • 89
  • 1
  • 7
  • 2
    47324 is very small... What is the problem with what you currently do? – Thierry Lathuille Apr 20 '20 at 17:48
  • @ThierryLathuille which file format is better in my case if i want to retrieve a list as it is, as i storing. – AS2 Apr 20 '20 at 17:49
  • 3
    @AS2 I would say JSON is likely the easiest – jordanm Apr 20 '20 at 17:50
  • @jordanm but the structure of my data is very simple, it is just text messages – AS2 Apr 20 '20 at 17:52
  • This question needs more focus and detail. *Exactly* why is the for loop a problem? – nicomp Apr 20 '20 at 17:53
  • How did your attempt to store it as a plain text file fail? – Jongware Apr 20 '20 at 17:53
  • If none of the strings contain newlines, carriage returns, or other control characters (which they shouldn't), the current method should work fine. JSON or other formats would work of course, but would add a bit of overhead. What's the problem with your current method? – wjandrea Apr 20 '20 at 17:54
  • 2
    @wjandrea i have edited my post and included error message – AS2 Apr 20 '20 at 17:59
  • Related? [UnicodeEncodeError: 'charmap' codec can't encode - character maps to , print function](https://stackoverflow.com/q/14630288/4518341), [Python, Unicode, and the Windows console](https://stackoverflow.com/q/5419/4518341) – wjandrea Apr 20 '20 at 18:06

2 Answers2

2

Using json is the simplest way:

import json

data = ['hi there', 'Hello!', 'Hi! How are you?', 'Not bad! And You?',
 "I'm doing well. Just got engaged to my high school sweetheart.",
 'Wowowowow! Congratulations! Is she pretty?', 
"She 's pretty cute. She invited me to dinner tonight. ",
 'Cool! Have a good time you both! And what is your hobby?',
 'I love music! I love Taylor swift. ']

# writing the data to file
with open('data.json', 'w') as f:
    json.dump(data, f)

# reading the data
with open('data.json') as f:
    read_data = json.load(f)

print(read_data)
# ['Hi there', 'Hello!', 'Hi! How are you?', 'Not bad! And You?', 
#  "I'm doing well. Just got engaged to my high school sweetheart.", 
#  'Wowowowow! Congratulations! Is she pretty?',
#  "She 's pretty cute. She invited me to dinner tonight. ", 
#  'Cool! Have a good time you both! And what is your hobby?', 
#  'I love music! I love Taylor swift. ']
Thierry Lathuille
  • 23,663
  • 10
  • 44
  • 50
1

I've modified your code to include utf-8 encoding when writing to file.txt

sentences = ['hi there', 'Hello!', 'Hi! How are you?', 'Not bad! And You?',
 "I'm doing well. Just got engaged to my high school sweetheart.",
 'Wowowowow! Congratulations! Is she pretty?', "She 's pretty cute. She invited me to dinner tonight. ",
 'Cool! Have a good time you both! And what is your hobby?', 'I love music! I love Taylor swift. ']

with open("file.txt", 'w', encoding="utf-8") as output:
    for sentence in sentences:
        output.write(str(sentence) + '\n')
MShakeG
  • 391
  • 7
  • 45