Finding intersection from list of list which has string as element

Question

I have the following list of list, in which the inner list has 2 items in string format.

neighbor_list = [['Mo0',
  '[PeriodicSite: S (1.5952, -0.9210, 37.6032) [0.3333, -0.3333, 0.9458], PeriodicSite: S (0.0000, 1.8419, 37.6032) [0.3333, 0.6667, 0.9458], PeriodicSite: S (3.1903, 1.8419, 37.6032) [1.3333, 0.6667, 0.9458], PeriodicSite: S (1.5952, -0.9210, 34.4734) [0.3333, -0.3333, 0.8671], PeriodicSite: S (0.0000, 1.8419, 34.4734) [0.3333, 0.6667, 0.8671], PeriodicSite: S (3.1903, 1.8419, 34.4734) [1.3333, 0.6667, 0.8671]]'],
 ['Mo1',
  '[PeriodicSite: S (1.5952, -0.9210, 12.7242) [0.3333, -0.3333, 0.3200], PeriodicSite: S (0.0000, 1.8419, 12.7242) [0.3333, 0.6667, 0.3200], PeriodicSite: S (3.1903, 1.8419, 12.7242) [1.3333, 0.6667, 0.3200], PeriodicSite: S (1.5952, -0.9210, 9.5944) [0.3333, -0.3333, 0.2413], PeriodicSite: S (0.0000, 1.8419, 9.5944) [0.3333, 0.6667, 0.2413], PeriodicSite: S (3.1903, 1.8419, 9.5944) [1.3333, 0.6667, 0.2413]]'],
 ['Mo2',
  '[PeriodicSite: S (-1.5952, 0.9210, 30.1636) [-0.3333, 0.3333, 0.7587], PeriodicSite: S (1.5952, 0.9210, 30.1636) [0.6667, 0.3333, 0.7587], PeriodicSite: S (0.0000, 3.6839, 30.1636) [0.6667, 1.3333, 0.7587], PeriodicSite: S (-1.5952, 0.9210, 27.0339) [-0.3333, 0.3333, 0.6800], PeriodicSite: S (1.5952, 0.9210, 27.0339) [0.6667, 0.3333, 0.6800], PeriodicSite: S (0.0000, 3.6839, 27.0339) [0.6667, 1.3333, 0.6800]]'],
 ['Mo3',
  '[PeriodicSite: S (-1.5952, 0.9210, 5.2846) [-0.3333, 0.3333, 0.1329], PeriodicSite: S (1.5952, 0.9210, 5.2846) [0.6667, 0.3333, 0.1329], PeriodicSite: S (0.0000, 3.6839, 5.2846) [0.6667, 1.3333, 0.1329], PeriodicSite: S (-1.5952, 0.9210, 2.1548) [-0.3333, 0.3333, 0.0542], PeriodicSite: S (1.5952, 0.9210, 2.1548) [0.6667, 0.3333, 0.0542], PeriodicSite: S (0.0000, 3.6839, 2.1548) [0.6667, 1.3333, 0.0542]]']]

The first item in the inner list (say Mo0) is the center and all the S in second item are the surroundings. First I want to print the list of center atom addeded to the surroundings e.g. Mo0S6, Mo1S6, M02S6 and so on. Then I want to find if there are any common S between Mo0, Mo1, Mo2, Mo3 by using their coordinates, e.g. the coordinates for S in neighbor of Mo0 are :

S (1.5952, -0.9210, 37.6032) 
S (1.5952, -0.9210, 12.7242)

and so on.

I can get the center and surroundings by doing

for i in range(len(neighbor_list)):
    center = neighbor_list[i][0]
    surroundings = neighbor_list[i][1]

How can I sum the number of surroundings for each center atom and find the intersection between surroundings?

The final goal is to get the matrix in the following format

      Mo0S6  Mo1S6  Mo2S6  Mo3S6
Mo0S6    0.0    0.0    0.0    0.0
Mo1S6    0.0    0.0    0.0    0.0
Mo2S6    0.0    0.0    0.0    0.0
Mo3S6    0.0    0.0    0.0    0.0

All elements in the dataframe are 0 because there are no common S in this list.

Could anyone please help me on this. Thanks in advance.

I think you should simplify your question, Make it more generic. It is hard to comprehend. — YOLO, Dec 19 '18 at 20:41
Thanks for your suggestion. The structure of list itself is quiet complicated. — hemanta, Dec 19 '18 at 20:48
So they are common if the first two numbers in the tuple are equivalent? And how do you arrive at `6`? (e.g. Mo2S6) — rahlf23, Dec 19 '18 at 20:51
Hi rahlf, they will be common only if all three coordinates are common. so , S (1.5952, -0.9210, 37.6032) S (1.5952, -0.9210, 12.7242) are not common. Thanks. — hemanta, Dec 19 '18 at 20:53
this will help you [link](https://stackoverflow.com/questions/53841562/how-to-sort-a-string-in-python-in-order-aab-instead-of-aba#53841588) sorting the way Mo0S6, Mo1S6, M02S6 . — sahasrara62, Dec 19 '18 at 20:55
['Mo2', '[PeriodicSite: S (-1.5952, 0.9210, 30.1636) [-0.3333, 0.3333, 0.7587], PeriodicSite: S (1.5952, 0.9210, 30.1636) [0.6667, 0.3333, 0.7587], PeriodicSite: S (0.0000, 3.6839, 30.1636) [0.6667, 1.3333, 0.7587], PeriodicSite: S (-1.5952, 0.9210, 27.0339) [-0.3333, 0.3333, 0.6800], PeriodicSite: S (1.5952, 0.9210, 27.0339) [0.6667, 0.3333, 0.6800], PeriodicSite: S (0.0000, 3.6839, 27.0339) [0.6667, 1.3333, 0.6800]]'] contains 6 S atoms so its summed up as Mo2S6. — hemanta, Dec 19 '18 at 20:55
Hi Prashanta, Thanks for the link, but I am not looking for sorting. I need to count the number of S, surrounded to each Mo . — hemanta, Dec 19 '18 at 20:58
Correct me if I'm wrong, but in your example, there exist **no** common points. — rahlf23, Dec 19 '18 at 21:07
Yes you are right, there are no common points on the part of list I posted, so in this case returning 0 is even fine. My goal is to build a pandas dataframe with Mo0S6, Mo1S6, Mo2S6, Mo3S6 as header for row, columns and 1 if they have common S and 0 if they dont share any. Can you please extend it up to that point. — hemanta, Dec 19 '18 at 21:11
I've updated my answer to include a method for identifying duplicates. — rahlf23, Dec 19 '18 at 21:17
Made some final modifications so that you don't have to resort to flattening the list of surroundings (accomplished in a similar manner using `stack()`) — rahlf23, Dec 19 '18 at 21:25

John · Answer 1 · 2018-12-19T21:13:56.777

Just parsing the strings without the need to import anything:

for item in neighbor_list:
    center=item[0]
    surroundings=item[1].split("PeriodicSite: S ")

    # remove extra brackets
    surroundings=surroundings[1:]
    surroundings[-1]=surroundings[-1][0:-1]

    print "%sS%d" % (center, len(surroundings))

    surroundings = [x.replace("("," ").replace(")"," ").replace("["," ").replace("]"," ").replace(","," ") for x in surroundings]
    surroundings = [x.split() for x in surroundings]

    for S in surroundings:
        print "S (%s,%s,%s)" % (S[0], S[1], S[2])

Gives:

Mo0S6
S (1.5952,-0.9210,37.6032)
S (0.0000,1.8419,37.6032)
S (3.1903,1.8419,37.6032)
S (1.5952,-0.9210,34.4734)
S (0.0000,1.8419,34.4734)
S (3.1903,1.8419,34.4734)
Mo1S6
S (1.5952,-0.9210,12.7242)
S (0.0000,1.8419,12.7242)
S (3.1903,1.8419,12.7242)
S (1.5952,-0.9210,9.5944)
S (0.0000,1.8419,9.5944)
S (3.1903,1.8419,9.5944)
Mo2S6
S (-1.5952,0.9210,30.1636)
S (1.5952,0.9210,30.1636)
S (0.0000,3.6839,30.1636)
S (-1.5952,0.9210,27.0339)
S (1.5952,0.9210,27.0339)
S (0.0000,3.6839,27.0339)
Mo3S6
S (-1.5952,0.9210,5.2846)
S (1.5952,0.9210,5.2846)
S (0.0000,3.6839,5.2846)
S (-1.5952,0.9210,2.1548)
S (1.5952,0.9210,2.1548)
S (0.0000,3.6839,2.1548

Thanks John, this is very helpful. – hemanta Dec 19 '18 at 21:23 — hemanta, Dec 19 '18 at 21:23

rahlf23 · Accepted Answer · 2018-12-19T21:24:33.250

You can clean up your data using ast.literal_eval and regex:

import pandas as pd
import re, ast

surrounding = [[ast.literal_eval(i) for i in re.findall(r'\([ ,.\d-]+\)', i[1])] for i in neighbor_list]
centers = ['{0}S{1}'.format(i[0], len(s)) for i, s in zip(neighbor_list, surrounding)]

data = dict(zip(centers, surrounding))

Gives:

{'Mo0S6': [(1.5952, -0.921, 37.6032), (0.0, 1.8419, 37.6032), (3.1903, 1.8419, 37.6032), (1.5952, -0.921, 34.4734), (0.0, 1.8419, 34.4734), (3.1903, 1.8419, 34.4734)],
'Mo1S6': [(1.5952, -0.921, 12.7242), (0.0, 1.8419, 12.7242), (3.1903, 1.8419, 12.7242), (1.5952, -0.921, 9.5944), (0.0, 1.8419, 9.5944), (3.1903, 1.8419, 9.5944)],
'Mo2S6': [(-1.5952, 0.921, 30.1636), (1.5952, 0.921, 30.1636), (0.0, 3.6839, 30.1636), (-1.5952, 0.921, 27.0339), (1.5952, 0.921, 27.0339), (0.0, 3.6839, 27.0339)],
'Mo3S6': [(-1.5952, 0.921, 5.2846), (1.5952, 0.921, 5.2846), (0.0, 3.6839, 5.2846), (-1.5952, 0.921, 2.1548), (1.5952, 0.921, 2.1548), (0.0, 3.6839, 2.1548)]}

You can then generate a dataframe directly using df = pd.Dataframe(data):

                       Mo0S6                      Mo1S6  \
0  (1.5952, -0.921, 37.6032)  (1.5952, -0.921, 12.7242)   
1     (0.0, 1.8419, 37.6032)     (0.0, 1.8419, 12.7242)   
2  (3.1903, 1.8419, 37.6032)  (3.1903, 1.8419, 12.7242)   
3  (1.5952, -0.921, 34.4734)   (1.5952, -0.921, 9.5944)   
4     (0.0, 1.8419, 34.4734)      (0.0, 1.8419, 9.5944)   
5  (3.1903, 1.8419, 34.4734)   (3.1903, 1.8419, 9.5944)   

                       Mo2S6                     Mo3S6  
0  (-1.5952, 0.921, 30.1636)  (-1.5952, 0.921, 5.2846)  
1   (1.5952, 0.921, 30.1636)   (1.5952, 0.921, 5.2846)  
2     (0.0, 3.6839, 30.1636)     (0.0, 3.6839, 5.2846)  
3  (-1.5952, 0.921, 27.0339)  (-1.5952, 0.921, 2.1548)  
4   (1.5952, 0.921, 27.0339)   (1.5952, 0.921, 2.1548)  
5     (0.0, 3.6839, 27.0339)     (0.0, 3.6839, 2.1548)

To find duplicates, we can simply use stack() and duplicated(keep=False), where keep=False ensures that we return both duplicates and their associated centers:

df.stack()[df.stack().duplicated(keep=False)]

Yields:

Series([], dtype: object)

You can confirm this method works by intentionally creating a duplicate in your sample data.

I can get the duplicates, I want to store the len(duplicates) as a corresponding element in the matrix, can you please add that line to make it more clear. — hemanta, Dec 19 '18 at 22:01
Keep in mind that is not a code-writing service. Can you update the question with what you have tried? Look at `pd.transform()` — rahlf23, Dec 19 '18 at 22:46

Finding intersection from list of list which has string as element

2 Answers2