Renumbering a Sequence of Numbers With Gaps using Python

Question

I am trying to figure out how to renumber a certain file format and struggling to get it right.

First, a little background may help: There is a certain file format used in computational chemistry to describe the structure of a molecule with the extension .xyz. The first column is the number used to identify a specific atom (carbon, hydrogen, etc.), and the subsequent columns show what other atom numbers it is connected to. Below is a small sample of this file, but the usual file is significantly larger.

  259   252             
  260   254                  
  261   255                  
  262   256                  
  264   248   265   268      
  265   264   266   269   270
  266   265   267   282      
  267   266                  
  268   264                  
  269   265       
  270   265   271   276   277
  271   270   272   273      
  272   271   274   278      
  273   271   275   279      
  274   272   275   280      
  275   273   274   281      
  276   270                  
  277   270                  
  278   272                  
  279   273                  
  280   274                  
  282   266   283   286      
  283   282   284   287   288
  284   283   285   289      
  285   284                  
  286   282                  
  287   283                  
  288   283                  
  289   284   290   293      
  290   289   291   294   295
  291   290   292   304

As you can see, the numbers 263 and 281 are missing. Of course, there could be many more missing numbers so I need my script to be able to account for this. Below is the code I have thus far, and the lists missing_nums and missing_nums2 are given as well, however, I would normally obtain them from an earlier part of the script. The last element of the list missing_nums2 is where I want numbering to finish, so in this case: 289.

    missing_nums = ['263', '281']
    missing_nums2 = ['281', '289']

    with open("atom_nums.xyz", "r") as f2:  
            lines = f2.read()
    
    for i in range(0, len(missing_nums) - 1):
        if i == 0:
            with open("atom_nums_out.xyz", "w") as f2: 
                
                replacement = int(missing_nums[i])
                
                for number in range(int(missing_nums[i]) + 1, int(missing_nums2[i])):
                    lines = lines.replace(str(number), str(replacement))
                    replacement += 1
                
                f2.write(lines)
    
        else:
            with open("atom_nums_out.xyz", "r") as f2:  
                lines = f2.read()
                
            with open("atom_nums_out.xyz", "w") as f2:   
                
                replacement = int(missing_nums[i]) - (i + 1)
                print(replacement)
                
                for number in range(int(missing_nums[i]), int(missing_nums2[i])):
                    lines = lines.replace(str(number), str(replacement))
                    replacement += 1
                    
                f2.write(lines)

The problem lies in the fact that as the file gets larger, there seems to be repeats of numbers for reasons I cannot figure out. I hope somebody can help me here.

EDIT: The desired output of the code using the above sample would be

  259   252                  
  260   254                  
  261   255                  
  262   256                  
  263   248   264   267      
  264   263   265   268   269
  265   264   266   280      
  266   265                  
  267   263                  
  268   264                  
  269   264   270   275   276
  270   269   271   272      
  271   270   273   277      
  272   270   274   278      
  273   271   274   279      
  274   272   273   279      
  275   269                  
  276   269                  
  277   271                  
  278   272                  
  279   273                  
  280   265   281   284      
  281   280   282   285   286
  282   281   283   287      
  283   282                  
  284   280                  
  285   281                  
  286   281                  
  287   282   288   291      
  288   287   289   292   293
  289   288   290   302

Which is, indeed, what I get as the output for this small sample, but as the missing numbers increase it seems to not work and I get duplicate numbers. I can provide the whole file if anyone wants.

Thanks!

Do you mean the numbers repeat in the input file or just the output file? — itprorh66, Dec 16 '20 at 16:40
If you are using an IDE **now** is a good time to learn its debugging features Or the built-in [Python debugger](https://docs.python.org/3/library/pdb.html). Printing *stuff* at strategic points in your program can help you trace what is or isn't happening. [What is a debugger and how can it help me diagnose problems?](https://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems) — wwii, Dec 16 '20 at 16:41
Why are you using ```with open("atom_nums.xyz", "r") as f2``` and ```with open("atom_nums_out.xyz", "w") as f2```? Should one be with f1 and the other with f2? — itprorh66, Dec 16 '20 at 16:44
@wwii Thanks for the reply, my apologies I didn't make it exactly clear. I've added the expected output to the OP now. — Neil, Dec 17 '20 at 08:16
@itprorh66 well I am opening the first one in read more so I can read all the data in the file and assign to the variable lines, then the second one in write mode to write the output. Since I am using with context managers, it doesn't especially matter if they are both called f2. But I suppose for easier readability, f1 and f2 would be better. — Neil, Dec 17 '20 at 08:21
Your operations change the *structure* of the molecule - that doesn't make sense to me but maybe that is ok. Your code is implementing some rules - you should state those rules. You say your code works for the example data given - how can we fix it if it works? Please read [mre]. `str.replace` will replace **all** instances of the *old* argument - you have a loop that replaces `264` with `263` then `265` with `264` ... that happens throughout the whole string not just the first column - is that what you intended? — wwii, Dec 17 '20 at 16:01
I am trying to understand the use of missing_num and missing_num2. I think you are using these two lists as follows: if a number x in the file falls within the mathematical range of missing_num[i]+1 and missing_num2[i] -1 replace the number with x-1. Is this a correct interpretation? — itprorh66, Dec 17 '20 at 16:26

score 0 · Answer 1 · answered Dec 17 '20 at 17:32

Assuming my interpretation of the lists missing_nums and missing_nums2 is correct, this is how I would perform the operation.

from os import rename
def fixFile(fn, mn1, mn2):
        with open(fn, "r") as fin:
            with open('tmp.txt', "w") as fout:
                for line in fin:
                    for i in range(len(mn1)):
                        minN = int(mn1[1])
                        maxN = int(mn2[i])
                        for nxtn in range(minN, maxN):
                            line.replace(str(nxtn), str(nxtn +1))
                    fout.write(line)
        rename(temp, fn)        
            

missing_nums = ['263', '281']
missing_nums2 = ['281', '289']
fn = "atom_nums_out.xyz"

fixFile(fn, missing_nums, missing_nums2)

Note, I am only reading the file in once a line at a time, and writing the result out a line at a time. I am then renaming the temp file to the original filename after all data is processed. This means, significantly longer files, will not chew up memory.

Renumbering a Sequence of Numbers With Gaps using Python

1 Answers1