0
def store_data(company, url, title, date, text):
    data = {'company_id': [], 'title':[],'date':[], 'link':[], 'main_headline':[] ,'main_headline_text':[]}
    
    data['company_id'].append('example')
    data['title'].append(title)
    data['date'].append(date)
    data['link'].append(url)
    data['main_headline'].append(title)
    data['main_headline_text'].append(text)
    
    df = pd.DataFrame(data)
    df.to_csv(company+'10k'+'.csv', index=False, line_terminator='\n')
    return df 

'text' is a string. when the data frame gets stored to csv, the following changes occur.

original text : The move is in alignment with abc Group’s vision of “building a world-leading clean energy and chemical company” and its development pattern of “One Foundation, Two Wings and Three Growth Points”, creating synergy of financial investment with industry, and contributing to another big step of abc in new energy and new materials.

text in csv : The move is in alignment with abc Group’s vision of “building a world-leading clean energy and chemical company†and its development pattern of “One Foundation, Two Wings and Three Growth Pointsâ€, creating synergy of financial investment with industry, and contributing to another big step of abc in new energy and new materials.

why is this happening? how can I avoid it so that the original text, even if it's not in english, comes as it is in the csv?

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
Soumya Pandey
  • 321
  • 3
  • 19
  • 5
    Looks like you're reading a UTF8 file using a single-byte codepage, eg Latin1. There's nothing wrong with the file. How are. you reading this file? The smart quotes in the original text aren't part of the ASCII range so they're encoded using 2 or more bytes. – Panagiotis Kanavos Feb 02 '21 at 06:52
  • 2
    Does this answer your question? [Pandas df.to\_csv("file.csv" encode="utf-8") still gives trash characters for minus sign](https://stackoverflow.com/questions/25788037/pandas-df-to-csvfile-csv-encode-utf-8-still-gives-trash-characters-for-min) – Panagiotis Kanavos Feb 02 '21 at 06:55
  • @Panagiotis Kanavos text is a string, that I scraped using beautifulSoup get_text(). How can I get the original text in the csv ? – Soumya Pandey Feb 02 '21 at 06:55
  • That *is* the original text. Again, how are you reading it? There's nothing wrong with the file. The problem is the editor – Panagiotis Kanavos Feb 02 '21 at 06:56
  • I'm opening the csv using excel and seeing. – Soumya Pandey Feb 02 '21 at 06:56
  • 1
    Read the duplicate then. BTW you aren't just opening the file in Excel (from the Data menu). You're double-clicking on it. Unless the file has a BOM, Excel can't guess the encoding so it *imports* the text using the encoding of the user's locale – Panagiotis Kanavos Feb 02 '21 at 06:57
  • yes, I am. When I open the csv in a text editor, it looks fine. When I open it in excel, by double clicking on it, it shows the encoded version. What do you mean 'read the duplicate' ? – Soumya Pandey Feb 02 '21 at 07:00
  • @PanagiotisKanavos oh! alright. now I understood. so is there a way to specify a bom to the file? – Soumya Pandey Feb 02 '21 at 07:02
  • Read the duplicate – Panagiotis Kanavos Feb 02 '21 at 07:06

0 Answers0