0

I have a text file containing

[(XXX)].XX>[(XXX)].X.XXX
XXX.[(X)].[(XXX)]>>[(XXX)].XX

There are about 10k lines. [(XXX)], XX Theses can be 1 to 10 of them.

I want 2 data frame/CSV containing

Data frame 1

     1       2       3         
1 [(XXX)]   XX 
2 XXX      [(X)]  [(XXX)]

Data frame 2

     1        2    3   
1  [(XXX)]    X   XXX
2  [(XXX)]   XX

I am trying like this but failed

def get_sentences(filename):
    with open(filename) as file_contents:
        d1, d2 ,d3= '>', '>>','.' # just example delimiters
        results = []
        for line in file_contents:
            if d1 in line:
                results = []
            elif d2 in line:
                yield results
            else:
                results.append(line)

Appreciate any suggestion.

Actual dataset

[Na+].[CH3:2][C:3](=[O:5])[O-].[CH3:6][c:7]1[cH:12][cH:11][cH:10][cH:9][cH:8]1>>[c:7]1([CH3:6])[c:12]([C:3]([c:2]2[cH:11][cH:12][cH:7][cH:8][c:9]2[CH3:10])=[O:5])[cH:11][cH:10][cH:9][cH:8]1
[CH:1]1([C:4]([c:6]2[cH:11][cH:10][c:9]([C:12]([CH3:20])(C)[C:13](N(C)OC)=O)[cH:8][cH:7]2)=[O:5])[CH2:3][CH2:2]1.[BrH:21].[C:22](=[O:25])([O-])[OH:23].[Na+]>O>[Br:21][CH2:3][CH2:2][CH2:1][C:4]([c:6]1[cH:11][cH:10][c:9]([C:12]([CH3:20])([CH3:13])[C:22]([OH:23])=[O:25])[cH:8][cH:7]1)=[O:5]
  • 1
    Changing your question substantially after you received answers is bad because it invalidates tae answers you already received, and it's hard for a new visitor to understand how the answers relate to the now different question. You should probably roll back your latest edit and ac+ept one of the answers, then maybe post a new question with your *actual* requirements if you still can't figure it out. (Feel free to post an answer of your own and accept that, provided of course that it actually solves the question as originally stated.) – tripleee Jul 18 '20 at 08:25

2 Answers2

0

At first, we are opening a file and using the readlines() function to grab all lines. Then we are iterating over the data variable and splitting the line by the .. Then we are creating a new dictionary element - the first element of splitter is the key, the others are the value. At the end, using the pandas DataFrame() function we are creating a DataFrame using created dictionary

import pandas as pd
with open('file_name.txt') as f:
    data = f.readlines()
buffer = {}
for i in data:
    splitter = i.split('.')
    buffer[splitter[0]] = splitter[1:]
df = pd.DataFrame(buffer)
Aleksander Ikleiw
  • 2,549
  • 1
  • 8
  • 26
0

try this,

import re
from io import StringIO

text = StringIO("""[(XXX)].XX>[(XXX)].X.XXX
XXX.[(X)].[(XXX)]>>[(XXX)].XX""")

df1, df2 = [], []

for l in text.readlines():
    x, y = re.split(r">+", l)
    df1.append(x.split("."))
    df2.append(y.split("."))

print(pd.DataFrame(df1))
print(pd.DataFrame(df2))

         0      1        2
0  [(XXX)]     XX     None
1      XXX  [(X)]  [(XXX)]

         0   1     2
0  [(XXX)]   X   XXX
1  [(XXX)]  XX  None
sushanth
  • 8,275
  • 3
  • 17
  • 28