-1

I need your help please.

I have a text file contains lines of lists, each line represent a list of items. I need to extract all items that have a frequency of >=2 and output them into another file.Here an example.

['COLG-CAD-406', 'CSAL-CAD-030', 'COLG-CAD-533', 'COLG-CAD-188']

['COLG-CAD-188']

['CSAL-CAD-030']

['EPHAG-JAE-004']

['COLG-CAD-188', 'CEM-SEV-004']

['COL-CAD-188', 'COLG-CAD-406']

the output should be

['COLG-CAD-406'], 2

['CSAL-CAD-030'], 2

['COLG-CAD-188'], 3

and so on till the end of the file

Thank you very much for your help in advance.

jamylak
  • 128,818
  • 30
  • 231
  • 230
saied salah
  • 31
  • 1
  • 5

4 Answers4

2

What about:

for x in f.readlines():
    words = ast.literal_eval(x)
    count = {}
    for w in words:        
        count[w] = count.get(w, 0) + 1
    for word, freq in count.iteritems():
        if freq >= 2:
            print word, freq

where f is your file

Ord
  • 5,693
  • 5
  • 28
  • 42
0

If you are using python 2.7 and up, with this input (called list1.txt):

['COLG-CAD-406', 'CSAL-CAD-030', 'COLG-CAD-533', 'COLG-CAD-188']
['COLG-CAD-188']
['CSAL-CAD-030']
['EPHAG-JAE-004']
['COLG-CAD-188', 'CEM-SEV-004']
['COLG-CAD-188', 'COLG-CAD-406']

and this python program:

from collections import Counter
import ast

cnt = Counter()

with open("list1.txt") as lfile:
    for line in lfile:
        # eval() could lead to python code injection so use literal_eval
        # the result is a list that we can directly use to update cnt keys
        cnt.update(ast.literal_eval(line))

for k, v in iter(cnt.items()):
    if v>=2:
        print("%s: %d"%  (k, v))

you get what you want:

CSAL-CAD-030: 2
COLG-CAD-406: 2
COLG-CAD-188: 4
Andreas Florath
  • 4,418
  • 22
  • 32
  • 1
    [eval() will allow malicious data to compromise your entire system, kill your cat, eat your dog and make love to your wife.](http://stackoverflow.com/questions/661084/security-of-pythons-eval-on-untrusted-strings) – joaquin Apr 27 '12 at 19:21
  • Thank you very much. But when i applied the above code i got the following error from collections import Counter ImportError: cannot import name Counter. I do not know why, Could you please help me to fix this issue. Many thanks – saied salah Apr 28 '12 at 07:40
  • @saiedsalah: As I wrote in the answer: you need at least [python version 2.7](http://docs.python.org/library/collections.html). If you run this with an older version, you get this error. – Andreas Florath Apr 28 '12 at 07:54
  • Thank you very much. But still i have errors when i applied the code.for k, v in cnt.iteritems(): AttributeError: 'Counter' object has no attribute 'iteritems' – saied salah Apr 28 '12 at 08:25
  • @saiedsalah: I guess you was using python 3.x. `iteritems()` is not available in 3. I changed the code that it now works with python 2.7 and 3 (tested with 3.2). – Andreas Florath Apr 28 '12 at 08:50
0

This is a complete script that does exactly what you want, using regex:

from collections import defaultdict
import re

myarch = 'C:/code/test5.txt'   #this is your archive
mydict = defaultdict(int)

with open(myarch) as f:
    for line in f:
        codes = re.findall("\'(\S*)\'", line)
        for key in codes:
            mydict[key] +=1

out = []
for key, value in mydict.iteritems():
    if value > 1:
        text = "['%s'], %s" % (key, value)
        out.append(text)

#save to a file
with open('C:/code/fileout.txt', 'w') as fo:
    fo.write('\n'.join(out))

This can be simplified as:

from collections import defaultdict
import re

myarch = 'C:/code/test5.txt'
mydict = defaultdict(int)

with open(myarch) as f:
    for line in f:
        for key in re.findall("\'(\S*)\'", line):
            mydict[key] +=1

out = ["['%s'], %s" % (key, value) for key, value in mydict.iteritems() if value > 1]

#save to a file
with open('C:/code/fileout.txt', 'w') as fo:
    fo.write('\n'.join(out))
joaquin
  • 82,968
  • 29
  • 138
  • 152
  • Thank you very much. really it's what i want.but i need some modification because the original file contains another column like this 1298962762.0 ['EPHAG-JAE-004'] 1298962802.0 ['CEM-SEV-003', 'CEM-SEV-004'] i need to print the same thing but to keep the first column which contains number.This is the final output 1298962762.0 CSAL-CAD-030 2 thank you – saied salah Apr 28 '12 at 07:20
  • @saiedsalah you shouldn't change your specifications in a comment. My answer is answering your current post. Either you modify your post with the new requirements (very odd after so many people answering to your initial conditions) or you vote these answers and open a new question. – joaquin Apr 28 '12 at 17:46
0

Input:

['COLG-CAD-406', 'CSAL-CAD-030', 'COLG-CAD-533', 'COLG-CAD-188']

['COLG-CAD-188']

['CSAL-CAD-030']

['EPHAG-JAE-004']

['COLG-CAD-188', 'CEM-SEV-004']

['COL-CAD-188', 'COLG-CAD-406']

Output

>>> from collections import Counter
>>> from ast import literal_eval
>>> with open('input.txt') as f:
        c = Counter(word for line in f if line.strip() for word in literal_eval(line))


>>> print '\n'.join('{0}, {1}'.format([word],freq) for word,freq in c.iteritems() if freq >= 2)
['CSAL-CAD-030'], 2
['COLG-CAD-406'], 2
['COLG-CAD-188'], 3
jamylak
  • 128,818
  • 30
  • 231
  • 230