-1

I have one input file in which there is one row where multiple mu(μ) characters are there. Python code just open the file and does some manipulation and we save that file in .csv format. When I save that file in .csv it is producing some weird and funny characters (�). The attached images show the input file and output files when I open in Excel.

Input CSV file:

InputCSVFILE

Output CSV file:

OutputCSVFILE

from pathlib import Path
import pandas as pd
import time
import argparse

parser = argparse.ArgumentParser(description='Process some integers.')

parser.add_argument('path',
                    help='define the directory to folder/file')

start = time.time()

def main(path_files):
    rs_columns = "SourceFile,RowNum,SampleID,Method,Element,Result".split(",")
    rs = pd.DataFrame(columns=rs_columns)
    if path_files.is_file():
        fnames = [path_files]
    else:
        fnames = list(Path(path_files).glob("*.csv"))
        
    for fn in fnames:
        if "csv" in str(fn):
            #df = pd.read_csv(str(fn))
            df = pd.read_csv(str(fn), header=None, sep='\n')
            df = df[0].str.split(',', expand=True)
        else:
            print("Unknown file", str(fn))
            
        non_null_columns = [col for col in df.columns if df.loc[:, col].notna().any()]    
        
        # loop thru each column for the whole file and create a row of results in the output file
        for i in range(1,len(non_null_columns)):
            SourceFile = Path(fn.name)
            Method = "WetScreening"
            Element = df.iloc[1,i]
            print(Element)
            for j in range(2,len(df)):
                RowNum = j+1
                Result = df.iloc[j,i]
                SampleID = df.iloc[j,0]                
                rs = rs.append(pd.DataFrame({
                    "SourceFile": [SourceFile],
                    "RowNum": [RowNum],
                    "SampleID": [SampleID],
                    "Method": [Method],
                    "Element": [Element],                                        
                    "Result": [Result]
                }),ignore_index=True)
    rs.to_csv("check.csv",index=False)
    print("Output: check.csv")
  
if __name__== "__main__":
    start = time.time()
    args = parser.parse_args()
    path = Path(args.path)
    main(path)
    print("Processed time: ", time.time()-start)

Attach files here

Any help????

karel
  • 5,489
  • 46
  • 45
  • 50

1 Answers1

0

Try encoding to utf-8:

rs.to_csv("check.csv",index=False, encoding='UTF-8')

See also Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign That answer mentions the BOM bytes (0xEF, 0xBB, 0xBF) at the start of the file that acts as a utf-8 signature.

rd.to_csv('file.csv', index=False, encoding='utf-8-sig')
Tarik
  • 10,810
  • 2
  • 26
  • 40
  • this method is not working. I already tried this method. – Lowell Lowellcraft Sep 25 '21 at 03:53
  • When you open the file in notepad, specify utf-8 encoding. – Tarik Sep 25 '21 at 03:55
  • When you open the input file, what encoding do you use? Same question goes for reading CSV. You need to keep consistent encoding. – Tarik Sep 25 '21 at 03:58
  • it is not working. rd.to_csv('file.csv', index=False, encoding='utf-8-sig') when I tried this method and when I open in excel it is showing some weird/funny characters. – Lowell Lowellcraft Sep 25 '21 at 04:00
  • The question is what is the original encoding of your file. If you open your file using notepad, what encoding are you using? – Tarik Sep 25 '21 at 04:04
  • 1
    You are not showing the output csv file. You show what appears when you read it into Excel. That is often not the same thing. To see what is *really* in the csv open it in a text editor with proper encoding support like Notepad++ which is free. – BoarGules Sep 25 '21 at 07:44