0

I have a text file that I would like to be formatted into a pandas dataframe. It is read as a string in the form of:
print(text)=

product: 1
description: product 1 desc
rating: 7.8
review: product 1 review

product: 2
description: product 2 desc
rating: 4.5
review: product 2 review

product: 3
description: product 3 desc
rating: 8.5
review: product 3 review

I figured I would split them by using text.split('\n\n') to group them into lists. I would assume iterating each into a dict, then loading to a pandas df would be a good route, but I am having trouble doing so. Is this the best route, and could someone please help me get this into a pandas df?

lut17
  • 95
  • 5

1 Answers1

1

You can use read_csv with create groups by compare first column by product string and pivot:

df = pd.read_csv('file.txt', header=None, sep=': ', engine='python')
df = df.assign(g = df[0].eq('product').cumsum()).pivot('g',0,1)
print (df)
0      description product rating             review
g                                                   
1   product 1 desc       1    7.8   product 1 review
2   product 2 desc       2    4.5   product 2 review
3   product 3 desc       3    8.5   product 3 review

Or create list of dictionaries:

#https://stackoverflow.com/a/18970794/2901002
data = []
current = {}
with open('file.txt') as f:
    for line in f:
        pair = line.split(':', 1)
        if len(pair) == 2:
            if pair[0] == 'product' and current:
                # start of a new block
                data.append(current)
                current = {}
            current[pair[0]] = pair[1].strip()
    if current:
        data.append(current)
        
df = pd.DataFrame(data)
print (df)
  product     description rating            review
0       1  product 1 desc    7.8  product 1 review
1       2  product 2 desc    4.5  product 2 review
2       3  product 3 desc    8.5  product 3 review

Or reshape each 4 values to 2d numpy array and pass to DataFrame constructor:

df = pd.read_csv('file.txt', header=None, sep=': ', engine='python')

df = pd.DataFrame(df[1].to_numpy().reshape(-1, 4), columns=df[0].iloc[:4].tolist())
print (df)
  product     description rating            review
0       1  product 1 desc    7.8  product 1 review
1       2  product 2 desc    4.5  product 2 review
2       3  product 3 desc    8.5  product 3 review
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • The first last methods worked great on my data sample set. I asked a dumbed down version trying to understand the concept, and my huge dataset is running into an error after making certain adjustments. `ParserError: Expected 2 fields in line 568502, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.` I am having trouble figuring out exactly what the issue is here. Any ideas of things to look for with this error? – lut17 Apr 07 '21 at 02:55
  • @rfl1735 - Is possible check line `568502` ? If problem with quoting, maybe help `df = pd.read_csv('file.txt', header=None, sep=': ', engine='python', quoting=3)` – jezrael Apr 07 '21 at 04:23
  • Using `quoting=3` gets the error `ParserError: Expected 2 fields in line 568502, saw 3.` Same error, just without the `Error could possibly be due to quotes...` Line 568502 is 'review/text: While, we wouldn't have blind tasted picked out the idea of Mozzarella for these, the flavor is still cheesy and good. Everyone in my family enjoyed these. A light flavor that left us all fighting over the to the last bite. Definitely going on the grocery list.' with the outside quotes being added by me now. – lut17 Apr 07 '21 at 13:29
  • @rfl1735 - I think there is problem with double `:`, so pandas incorrectly split it it 3 columns. Possible solution is use [this](https://stackoverflow.com/a/57522016/2901002) for split by first `:` – jezrael Apr 07 '21 at 13:51
  • I'm confused because I only see one `: ` in that line. Where are you seeing the other? – lut17 Apr 07 '21 at 16:00
  • Hmmm, you are right, then no idea why not working. If not confidental, is possible share file with some 5 rows before and after line 568502 like csv by dropbox or gdocs or wetransfer or similar for test problem? – jezrael Apr 07 '21 at 19:10