How to search, compare and manipulate based on the first letters in a big file

Question

I know the title is pretty confusing. So, what I have now is a txt file. For example:

*Product 1    Orange
*Product 2    Banana 
*Product 3    Peach
*Product 4    Orange
*Product 5    Peach

So, my goal is to print two of the same products together. "Product 1 product 4 " "Product 3 product 5" I thought about having a for loop

     for line in lis:
        if line[0]=="*":
           continue
        else:
            print(line)

But it prints individual character instead. Can someone please help? How do I use the for loop to scan ever line instead?

You can try lis.split(‘*’), because for loop loops over characters. — Tim, Dec 16 '19 at 00:38
What is the data format? Is that a tab between the Product number and the fruit? Is this test data, or the real thing? — AMC, Dec 16 '19 at 01:07
Also I think there are multiple simple questions in here. What is the actual issue? — AMC, Dec 16 '19 at 02:23
Its a test sample. Yes, between the word product and the fruit, thats a tab. — Danny Xu, Dec 16 '19 at 03:19

score 1 · Answer 1 · answered Dec 16 '19 at 01:39

Here an example , you can use pandas and numpy for handle a big files...just install pandas and numpy using pip.

import pandas as pd
import numpy as np
#Reading your text file delimited by space , I'm adding headers 'Product','Num','Fruit'
df = pd.read_csv('yourtxtfile.txt',delim_whitespace=True,names=['Product','Num','Fruit'])
# Merge Product and Num
df['Product_num'] = df.agg('{0[Product]} {0[Num]}'.format, axis=1)
df.drop(['Product', 'Num'], axis=1, inplace=True)
# Pivot rows for build a cell like a *Product 1,*Product 4 for each Fruit
print(pd.pivot_table(df,index=['Fruit'],values='Product_num',aggfunc=lambda x: ','.join(x)))

Result :

                  Product_num
Fruit
Banana             *Product 2
Orange  *Product 1,*Product 4
Peach   *Product 3,*Product 5

Gerd · Answer 2 · 2019-12-17T11:59:25.777

1

You can read the file line-by-line and then use a dictionary data structure with the fruit as the key and the products as the values:

dict = {}
for line in lines:
  l = line[1:].split() # remove '*' from line
  fruit = l[2]
  product = l[0] + ' ' + l[1]
  if fruit in dict:
    dict[fruit] += ' ' + product
  else:
    dict.update({fruit : product})

For your example, this yields:

{'Orange': 'Product 1 Product 4', 'Banana': 'Product 2', 'Peach': 'Product 3 Product 5'}

edited Dec 17 '19 at 11:59

answered Dec 17 '19 at 10:59

Gerd

2,568
1
7
20

That way, wouldn't the for loop read letters instead of lines? So when you use l[2], wouldn't it read the third letter of the line after you remove the '*'? – Danny Xu Dec 17 '19 at 13:01
No: read the file line-by-line (e.g. with `readlines()`as shown in the linked question), then `lines` is a list of all lines in the file, `line` is a string containing the single current line, `l` is the split line (i.e. a list of the single "words", after dropping the `*`), so `l[0]` is always the word "Product", `l[1]` is the number, and `l[2]` is the name of the fruit. – Gerd Dec 17 '19 at 13:09

How to search, compare and manipulate based on the first letters in a big file

2 Answers2