0

I have a python code that is presenting a different behavior when I run it on Windows and when I run it on CentOS. Below is the partial code that is of interest for this issue with comments to explain what is the purpose. It basically process a bunch of CSV files (some of them with different columns from each other) and merge them into a single CSV that has all the columns:

#Get the name of CSV files of the current folder:
 local_csv_files = glob("*.csv")
 #Define the columns and the order they should appear on the final file:
 global_csv_columns = ['Timestamp', 'a_country', 'b_country', 'call_setup_time','quality','latency','throughput','test_type']
 #Dataframe list:
 lista_de_dataframes=[]
 
 #Loop to be executed for all the CSV files in the current folder.
 for ficheiro_csv in local_csv_files:
    df = pd.read_csv(ficheiro_csv)
    #Store the CSV columns on a variable and collect the number of columns:
    colunas_do_csv_aux= df.columns.values
    global_number_of_columns = len(global_csv_columns)
    aux_csv_number_of_columns = len(colunas_do_csv_aux)
    #Normalize each CSV file so that all CSV files have the same columns
    for coluna_ in global_csv_columns:
       if search_column(colunas_do_csv_aux, coluna_)==False:
          #If the column does not exist in the current CSV, add an empty column with the correct header:
          df.insert(0, coluna_, "")
    #Order the dataframe columns according to the order of the global_csv_columns list:
    df = df[global_csv_columns]
    lista_de_dataframes.append(df)
    del df
 big_unified_dataframe = pd.concat(lista_de_dataframes, copy=False).drop_duplicates().reset_index(drop=True)
 big_unified_dataframe.to_csv('global_file.csv', index=False)

#Create an additional txt file to present with each row of the CSV in a JSON format:
with open('global_file.csv', 'r') as arquivo_csv:
   with open('global_file_c.txt', 'w') as arquivo_txt:
      reader = csv.DictReader(arquivo_csv, global_csv_columns)
      iterreader = iter(reader)
      next(iterreader)
      for row in iterreader:
         out=json.dumps(row)
         arquivo_txt.write(out)

Now, on Windows and on CentOS, this works well for the final CSV since it has all the columns ordered as defined in the list:

global_csv_columns = ['Timestamp', 'a_country', 'b_country', 'call_setup_time','quality','latency','throughput','test_type']

This ordering is achieved by this code line:

#Order the dataframe columns according to the order of the global_csv_columns list:
    df = df[global_csv_columns]

But the final ‘txt’ file is different on CentOS. In CentOS the order is changed. Below the output of the txt file in both platforms (windows and CentOS).

Windows:

{"Timestamp": "06/09/2022 10:33", "a_country": "UAE", "b_country": "UAE", "call_setup_time": "7.847", "quality": "", "latency": "", "throughput": "", "test_type": "voice_call"}
{"Timestamp": "06/09/2022 10:30", "a_country": "Saudi_Arabia", "b_country": "Saudi_Arabia", "call_setup_time": "10.038", "quality": "", "latency": "", "throughput": "", "test_type": "voice_call"}
...

CentOS:

{"latency": "", "call_setup_time": "7.847", "Timestamp": "06/09/2022 10:33", "test_type": "voice_call", "throughput": "", "b_country": "UAE", "a_country": "UAE", "quality": ""}
{"latency": "", "call_setup_time": "10.038", "Timestamp": "06/09/2022 10:30", "test_type": "voice_call", "throughput": "", "b_country": "Saudi_Arabia", "a_country": "Saudi_Arabia", "quality": ""}
...

Is there any way to assure the column order in CentOS?

rcmv
  • 151
  • 2
  • 3
  • 14
  • What version of python have you installed in Windows and in CentOS (you might check that by doing `python --version`)? – Daweo Sep 12 '22 at 11:41
  • On CentOS I’m running: Python 2.7.18 On Windows I’m running: Python 3.9.6 I tried to install a recent version on CentOS but wasn’t able to. If you know which command/version/repository I should use to install a similar version on CentOS please let me know. – rcmv Sep 12 '22 at 12:07

3 Answers3

1

On CentOS I’m running: Python 2.7.18 On Windows I’m running: Python 3.9.6

Now reason is clear: order inside common dicts was added in python3.6 as implemention specific and is required to be furnished in python3.7 and newer.

Read Are dictionaries ordered in Python 3.6+? if you want to know more.

If you know which command/version/repository I should use to install a similar version on CentOS please let me know.

Optimal solution would be to have same python versions up to minor, that is if you have 3.9.6 on your Windows machine then python3.9 on CentOS. If you are unable to install it python3.7 or python3.8 should do, however be warned that if you have both python2 and python3 installed on single machine, then you should use python3 if you want to use newer version, e.g.

python3 helloworld.py

where helloworld.py is file with python code.

Daweo
  • 31,313
  • 3
  • 12
  • 25
  • I've installed python3.7. Now, when I execute the script with python3 command, I'm still getting some columns out of order and I'm using the flag sort_keys: out=json.dumps(row, sort_keys=True). The order I get now is: {"Timestamp", "a_country", "b_country", "call_setup_time", "latency", "quality", "test_type", "throughput"} – rcmv Sep 12 '22 at 13:53
  • wait! After I removed the flag sort_keys it worked! Tkz :D – rcmv Sep 12 '22 at 13:56
0

try the pd.DataFrame.to_json function which allows you to write a dataframe to a json file directly. This will allow you to write a df to the json file without reading it from a csv file. I suspect this function may allow you to write without changing the order of the column.

Ryo Suzuki
  • 152
  • 8
  • The thing is I'm making some 'find/replace' operations on txt file which are not described on the code above (didn't think it was relevant for the issue). Not sure if they would work on the json file. But, can you give me an example on how to apply that function on my code? – rcmv Sep 12 '22 at 12:09
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 16 '22 at 12:32
0

Your output JSON dictionaries aren't sorted so the order in which the tags appear could be random. I think in practice the tags usually appear in the order in which they were created in each dictionary but if you can have the dictionaries sorted by tag:

out=json.dumps(row, sort_keys=True)

This will at least make them consistent although you may place more meaning on some tags.

John
  • 309
  • 2
  • 11
  • It changed the order but I'm still getting different results. Now, on CentOS, I'm getting: {"Timestamp": "05/09/2022 12:27", "a_country": "UK", "b_country": "", "call_setup_time": "", "latency": "90.872", "quality": "", "test_type": "data", "throughput": "3.4598"} – rcmv Sep 12 '22 at 12:11