0

I have strings of words and I want to find the frequency of each word group, print the words (doesn't matter if words appear multiple times), and the total frequency for each word group by each word.

PLEASE NOTE: In the solution, I don't want to use any loop like 'for' loop but arrive at same results.

For example, I have words as follows:

'abc'
'abc'
'abc'
'abc'
'xyz'
'xyz'
'tuf'
'pol'
'pol'
'pol'
'pol'
'pol'
'pol'

and need output as:

'abc', 4
'abc', 4
'abc', 4
'abc', 4
'xyz', 2
'xyz', 2
'tuf', 1
'pol', 6
'pol', 6
'pol', 6
'pol', 6
'pol', 6
'pol', 6

I am using python3 and I have tried this code and it doesn't work as expected:

curr_tk = None                         
tk = None  
count = 0 

for items in data:
    line = items.strip()
    file = line.split(",") 
    tk = file[0]

   if curr_tk == tk:
      count += 1

   else:
      if curr_tk:
         print ('%s , %s' % (curr_tk, count))
      count = 1
      curr_tk = tk

  #print last word
  if curr_tk == tk:
      print ('%s , %s' % (curr_tk,count))

The above code gives me output as:

'abc', 4
'xyz', 2
'tuf', 1
'pol', 6
hrokr
  • 3,276
  • 3
  • 21
  • 39
nkah
  • 13
  • 4
  • You're saying "I don't want to use any loop like 'for' loop ..." but your code contains a for loop. Is there a reason why you don't? – hrokr Oct 02 '22 at 02:51
  • `I don't want to use any loop like 'for' loop` do you mean that no for loops in the body of `for items in data`? – ILS Oct 02 '22 at 02:52
  • 1
    Loops are necessary anyway (loops hidden in C code are still loops), unless you can figure out a way to iterate through these strings without loops. – Mechanic Pig Oct 02 '22 at 02:53
  • `from collections import Counter; Counter(list_of_strings)` => `Counter({'pol': 6, 'abc': 4, 'xyz': 2, 'tuf': 1})`. – ekhumoro Oct 02 '22 at 11:42

3 Answers3

0

I probably understand what you want to do. You need to print the repeated strings, like 'abc', 4 for 4 times, but don't want to do this using a for loop. I don't understand why you restrict yourself.

A method is to use a buffer for the output content. I provide two ways, controlled by boolean first_way, to demonstrate this.

curr_tk = None                         
tk = None  
count = 0 

first_way = True
base_buffer = '{tk} , {count}\n'
output_buffer = ''
for items in data:
    line = items.strip()
    file = line.split(',') 
    tk = file[0]

    if curr_tk == tk:
        count += 1
        if first_way:
            output_buffer += base_buffer
    else:
        if curr_tk:
             if not first_way: # use operator '*' to copy str
                 # I guess the underlying implementation is also a loop
                 # not sure whether this violates the requirement
                 output_buffer = base_buffer * count
             print (output_buffer.format(tk=curr_tk, count=count), end='')
        count = 1
        curr_tk = tk
        if first_way:
            output_buffer = base_buffer

#print the last word group
if curr_tk:
    if not first_way:
        output_buffer = base_buffer * count
    print (output_buffer.format(tk=curr_tk, count=count), end='')

Giving data = ['abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'tuf'], you will get the ouput:

abc , 4
abc , 4
abc , 4
abc , 4
xyz , 2
xyz , 2
tuf , 1
ILS
  • 1,224
  • 1
  • 8
  • 14
  • ILS...this seems to be helpful enough but I have another question. Assume I have a dataset that looks like this: abc,(ven, rat) abc,(elf, kls) abc,(iop, pos) abc,(89d, 82h) xyz,(k9e, sx4) xyz,(dge, ijd) tuf,(asc, 07f) pol,(ew6, 891) pol,(9i9, sai) pol,(0h, vd) Note: Where the data in the tuple( ) is let's say a metadata info. Is there a way i can use your formula to get a solution like 'abc',(ven, rat),4 'abc',(elf, kls),4 'abc',(iop, pos),4 'abc',(89d, 82h),4 'xyz',(k9e, sx4),2 'xyz',(dge, ijd),2 'tuf',(asc, 07f),1 'pol',(ew6, 891),3 'pol',(9i9, sai),3 'pol',(0h, vd),3 – nkah Oct 02 '22 at 22:01
  • @nkah, you may need to update your question and make it clearer. – ILS Oct 03 '22 at 03:36
0

Using loop is unavoidable. But if you prefer not to see it, you can use pandas and let the package do the calculations in the background:

words = ['abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'tuf', 'pol', 'pol', 'pol', 'pol', 'pol', 'pol']

import pandas as pd
df = pd.DataFrame(words, columns=['words'])
df1 = pd.DataFrame(df.value_counts(), columns=['counts'])
df.join(df1, on='words', how='inner')

output:

   words  counts
0    abc       4
1    abc       4
2    abc       4
3    abc       4
4    xyz       2
5    xyz       2
6    tuf       1
7    pol       6
8    pol       6
9    pol       6
10   pol       6
11   pol       6
12   pol       6
0

I don't know if this will help but if you really don't want to use loop then don't use python at all use Sql. here's the code,

DECLARE @phrases TABLE (id int, phrase varchar(max)) INSERT @phrases values (1,'Red and White' ), (2,'green' ), (3,'White and blue' ), (4,'Blue' ), (5,'Dark blue' );

SELECT word, COUNT(*) c FROM @phrases CROSS APPLY (SELECT CAST(''+REPLACE(phrase,' ','')+'' AS xml) xml1 ) t1 CROSS APPLY (SELECT n.value('.','varchar(max)') AS word FROM xml1.nodes('a') x(n) ) t2 GROUP BY word

Babulo
  • 19
  • 4