1

Apologize for what may be a very easy Python problem. I am working with a txt file that has this format. It's all one line.

('text1','attribute1')('text2','attribute2')('text3','attribute3') .... ('text999','attribute999')

The file was originally written as a list of tuples but I would like to just extract it into a pandas dataframe with two columns. Is there an easy way to do that?

Edit: I suppose I need the first steps. Here is where I'm at:

myfile = open(file, 'r')
lines=myfile.readlines()

Output of lines looks like this, type list with length 1.

'(\'text1\', \'attribute1\')(\'text2\', \'attribute2\')

The backslashes aren't in the source txt file.

Ralph
  • 255
  • 2
  • 18
  • I suppose I need the first steps. Here is where I'm at: myfile = open(file, 'r') lines=myfile.readlines() Output of lines looks like this, type list with length 1. '(\'text1\', \'attribute1\')(\'text2\', \'attribute2\') – Ralph May 06 '19 at 14:04

3 Answers3

2

First you can read your string, then we can using str.extractall and split

s="('text1','attribute1')('text2','attribute2')('text3','attribute3')"

pd.Series(s).str.extractall(r'\((.*?)\)')[0].str.strip("'").str.split("','",expand=True)

Out[136]: 
             0           1
  match                   
0 0      text1  attribute1
  1      text2  attribute2
  2      text3  attribute3
BENY
  • 317,841
  • 20
  • 164
  • 234
2

You could use a str.replace and ast.literal_eval to convert the string to a proper list of tuples and then use pandas.DataFrame.from_records to create your DataFrame.

from ast import literal_eval
import pandas as pd

s = "('text1','attribute1')('text2','attribute2')('text3','attribute3')"

df = pd.DataFrame.from_records(literal_eval(f"[{s.replace(')(', '),(')}]"))

print(df)
#        0           1
# 0  text1  attribute1
# 1  text2  attribute2
# 2  text3  attribute3

# for python 3 versions pre-3.6 replace f string with "[{}]".format(s.replace(')(', '),('))

Per your question edit, you could do the following to open and read your file in order to get the string input for the above approach. Uses read to return the file content as a string rather than readlines since it appears that your file only contains a single line that you want to convert to a list of tuples. The escapes (backslashes) in your example are likely related to how you are outputting the string to your console and are not part of the string you will be processing with the read approach below if they don't exist in the source file.

with open('yourfile.txt') as f:
    s = f.read()
benvc
  • 14,448
  • 4
  • 33
  • 54
  • Using https://stackoverflow.com/questions/8369219/how-to-read-a-text-file-into-a-string-variable-and-strip-newlines?rq=1 to read in the file, then using this response to change to pandas df – Ralph May 06 '19 at 20:03
0

You can use:

# Remove the starting and ending brackets '(', ')'
sn = s.rstrip(")").lstrip("(").split(")(")
pd.DataFrame(list(map(lambda x: x.split(','), sn))).replace("'", "", regex=True)   # split by comma (,)

       0           1
0  text1  attribute1
1  text2  attribute2
2  text3  attribute3
heena bawa
  • 818
  • 6
  • 5