How to extract text into dataframe or csv every line using python

Question

I have a text file containing

[(XXX)].XX>[(XXX)].X.XXX
XXX.[(X)].[(XXX)]>>[(XXX)].XX

There are about 10k lines. [(XXX)], XX Theses can be 1 to 10 of them.

I want 2 data frame/CSV containing

Data frame 1

     1       2       3         
1 [(XXX)]   XX 
2 XXX      [(X)]  [(XXX)]

Data frame 2

     1        2    3   
1  [(XXX)]    X   XXX
2  [(XXX)]   XX

I am trying like this but failed

def get_sentences(filename):
    with open(filename) as file_contents:
        d1, d2 ,d3= '>', '>>','.' # just example delimiters
        results = []
        for line in file_contents:
            if d1 in line:
                results = []
            elif d2 in line:
                yield results
            else:
                results.append(line)

Appreciate any suggestion.

Actual dataset

[Na+].[CH3:2][C:3](=[O:5])[O-].[CH3:6][c:7]1[cH:12][cH:11][cH:10][cH:9][cH:8]1>>[c:7]1([CH3:6])[c:12]([C:3]([c:2]2[cH:11][cH:12][cH:7][cH:8][c:9]2[CH3:10])=[O:5])[cH:11][cH:10][cH:9][cH:8]1
[CH:1]1([C:4]([c:6]2[cH:11][cH:10][c:9]([C:12]([CH3:20])(C)[C:13](N(C)OC)=O)[cH:8][cH:7]2)=[O:5])[CH2:3][CH2:2]1.[BrH:21].[C:22](=[O:25])([O-])[OH:23].[Na+]>O>[Br:21][CH2:3][CH2:2][CH2:1][C:4]([c:6]1[cH:11][cH:10][c:9]([C:12]([CH3:20])([CH3:13])[C:22]([OH:23])=[O:25])[cH:8][cH:7]1)=[O:5]

Changing your question substantially after you received answers is bad because it invalidates tae answers you already received, and it's hard for a new visitor to understand how the answers relate to the now different question. You should probably roll back your latest edit and ac+ept one of the answers, then maybe post a new question with your *actual* requirements if you still can't figure it out. (Feel free to post an answer of your own and accept that, provided of course that it actually solves the question as originally stated.) — tripleee, Jul 18 '20 at 08:25

score 0 · Answer 1 · answered Jul 17 '20 at 19:01

0

At first, we are opening a file and using the readlines() function to grab all lines. Then we are iterating over the data variable and splitting the line by the .. Then we are creating a new dictionary element - the first element of splitter is the key, the others are the value. At the end, using the pandas DataFrame() function we are creating a DataFrame using created dictionary

import pandas as pd
with open('file_name.txt') as f:
    data = f.readlines()
buffer = {}
for i in data:
    splitter = i.split('.')
    buffer[splitter[0]] = splitter[1:]
df = pd.DataFrame(buffer)

answered Jul 17 '20 at 19:01

Aleksander Ikleiw

2,549
1
8
26

Can you please tell that if the text file contain only two line or more? – prax Jul 17 '20 at 19:08
There are about 10k lines. [(XXX)], XX Theses can be 1 to 10 of them. – Protima Rani Paul Jul 17 '20 at 19:16
Oh, I tried your also but am still facing a problem I have added the actual dataset at the bottom of the post (First two lines). Can you please help. – Protima Rani Paul Jul 17 '20 at 22:18

sushanth · Accepted Answer · 2020-07-17T19:40:51.780

0

try this,

import re
from io import StringIO

text = StringIO("""[(XXX)].XX>[(XXX)].X.XXX
XXX.[(X)].[(XXX)]>>[(XXX)].XX""")

df1, df2 = [], []

for l in text.readlines():
    x, y = re.split(r">+", l)
    df1.append(x.split("."))
    df2.append(y.split("."))

print(pd.DataFrame(df1))
print(pd.DataFrame(df2))

         0      1        2
0  [(XXX)]     XX     None
1      XXX  [(X)]  [(XXX)]

         0   1     2
0  [(XXX)]   X   XXX
1  [(XXX)]  XX  None

edited Jul 17 '20 at 19:40

answered Jul 17 '20 at 19:18

sushanth

8,275
3
17
28

Sorry, my mistake one sec. – Protima Rani Paul Jul 17 '20 at 19:44
p.splitlines() ERROR >>> too many values to unpack (expected 2);;; readlines() ERROR 'str' object has no attribute 'readlines' – Protima Rani Paul Jul 17 '20 at 20:35
f = open("Test(1).txt", "r") text=f.read() Got it now using readline() – Protima Rani Paul Jul 17 '20 at 20:40
try this, ```with open("Test(1).txt", "r") as f: for l in f.readlines()``` – sushanth Jul 17 '20 at 20:45

How to extract text into dataframe or csv every line using python

2 Answers2

Linked