I have a python code that is presenting a different behavior when I run it on Windows and when I run it on CentOS. Below is the partial code that is of interest for this issue with comments to explain what is the purpose. It basically process a bunch of CSV files (some of them with different columns from each other) and merge them into a single CSV that has all the columns:
#Get the name of CSV files of the current folder:
local_csv_files = glob("*.csv")
#Define the columns and the order they should appear on the final file:
global_csv_columns = ['Timestamp', 'a_country', 'b_country', 'call_setup_time','quality','latency','throughput','test_type']
#Dataframe list:
lista_de_dataframes=[]
#Loop to be executed for all the CSV files in the current folder.
for ficheiro_csv in local_csv_files:
df = pd.read_csv(ficheiro_csv)
#Store the CSV columns on a variable and collect the number of columns:
colunas_do_csv_aux= df.columns.values
global_number_of_columns = len(global_csv_columns)
aux_csv_number_of_columns = len(colunas_do_csv_aux)
#Normalize each CSV file so that all CSV files have the same columns
for coluna_ in global_csv_columns:
if search_column(colunas_do_csv_aux, coluna_)==False:
#If the column does not exist in the current CSV, add an empty column with the correct header:
df.insert(0, coluna_, "")
#Order the dataframe columns according to the order of the global_csv_columns list:
df = df[global_csv_columns]
lista_de_dataframes.append(df)
del df
big_unified_dataframe = pd.concat(lista_de_dataframes, copy=False).drop_duplicates().reset_index(drop=True)
big_unified_dataframe.to_csv('global_file.csv', index=False)
#Create an additional txt file to present with each row of the CSV in a JSON format:
with open('global_file.csv', 'r') as arquivo_csv:
with open('global_file_c.txt', 'w') as arquivo_txt:
reader = csv.DictReader(arquivo_csv, global_csv_columns)
iterreader = iter(reader)
next(iterreader)
for row in iterreader:
out=json.dumps(row)
arquivo_txt.write(out)
Now, on Windows and on CentOS, this works well for the final CSV since it has all the columns ordered as defined in the list:
global_csv_columns = ['Timestamp', 'a_country', 'b_country', 'call_setup_time','quality','latency','throughput','test_type']
This ordering is achieved by this code line:
#Order the dataframe columns according to the order of the global_csv_columns list:
df = df[global_csv_columns]
But the final ‘txt’ file is different on CentOS. In CentOS the order is changed. Below the output of the txt file in both platforms (windows and CentOS).
Windows:
{"Timestamp": "06/09/2022 10:33", "a_country": "UAE", "b_country": "UAE", "call_setup_time": "7.847", "quality": "", "latency": "", "throughput": "", "test_type": "voice_call"}
{"Timestamp": "06/09/2022 10:30", "a_country": "Saudi_Arabia", "b_country": "Saudi_Arabia", "call_setup_time": "10.038", "quality": "", "latency": "", "throughput": "", "test_type": "voice_call"}
...
CentOS:
{"latency": "", "call_setup_time": "7.847", "Timestamp": "06/09/2022 10:33", "test_type": "voice_call", "throughput": "", "b_country": "UAE", "a_country": "UAE", "quality": ""}
{"latency": "", "call_setup_time": "10.038", "Timestamp": "06/09/2022 10:30", "test_type": "voice_call", "throughput": "", "b_country": "Saudi_Arabia", "a_country": "Saudi_Arabia", "quality": ""}
...
Is there any way to assure the column order in CentOS?