0

I have a list of files where each file has two columns. The 1st column contains words, and the 2nd column contains numbers.

I want to extract all the unique words from the files, and sum the numbers in them. This I was able to do...

The second task is to count the number of files in which the words were found. I am having trouble in this part... I am using a dictionary for this.

Here is my code:

import os
from typing import TextIO

currentdir = " " #CHANGE INPUT PATH
resultdir = " " #CHANGE OUTPUT ACCORDINGLY

if not os.path.exists(resultdir):
    os.makedirs(resultdir)

systemcallcount ={}    
for root, dirs, files in os.walk(currentdir):
    for name in files:


        outfile2 = open(root+"/"+name,'r')
        for line in outfile2:
            words=line.split(" ")
            if words[0] not in systemcallcount:
                systemcallcount[words[0]]=int(words[1]) 
            else:
                systemcallcount[words[0]]+=int(words[1]) 



        outfile2.close()


for keys,values in systemcallcount.items():
    print(keys)
    print(values)  

for example I have two files -

file1  file2
a  2    a 3
b  3    b 1 
c  1     




so the output would be -

a 5 2
b 4 2
c 1 1

To explain second column of output a is 2 because it is occuring in both the files whereas c is 1 as it is appearing in only file1.

enter image description here

ubuntu_noob
  • 2,305
  • 6
  • 23
  • 64
  • Not sure the dictionary is a proper data structure for your task, I'd suggest list of dictionaries or list of tuples. Also the code much better of if you manage to separate to reading of files and operation on the file contents. – Evgeny May 21 '18 at 18:57

4 Answers4

0

One way is to use collections.defaultdict. You can create a set of words and then increment your dictionary counter for each file, for each word.

from collections import defaultdict

d = defaultdict(int)

for root, dirs, files in os.walk(currentdir):
    for name in files:

        with open(root+'/'+name,'r') as outfile2:
            words = {line.split()[0] for line in outfile2}
            for word in words:
                d[words[0]] += 1
jpp
  • 159,742
  • 34
  • 281
  • 339
0

I hope this helps

This code takes a string and checks in a folder for files that contain it

# https://www.opentechguides.com/how-to/article/python/59/files-containing-text.html

search_string="python"
search_path="C:\Users\You\Desktop\Project\Files"
extension="txt" # files extension

# loop through files in the path specified
for fname in os.listdir(search_path):
    if fname.endswith(file_type):
        # Open file for reading
        fo = open(search_path + fname)
        # Read the first line from the file
        line = fo.readline()
        # Initialize counter for line number
        line_no = 1
        # Number of files found is 0
        files_no=0;
        # Loop until EOF
        while line != '' :
            # Search for string in line
            index = line.find(search_str)
            if ( index != -1) :
                # print the occurence
                print(fname, "[", line_no, ",", index, "] ", line, sep="")
                # Read next line
                line = fo.readline()  
                # Increment line counter
                line_no += 1
                # Increment files counter
                files_no += 1
                # Close the files
                fo.close()
Amine Messaoudi
  • 2,141
  • 2
  • 20
  • 37
0

Another way is to use Pandas to work on both of your tasks.

  1. Read the files into a table
  2. Note the source file in a separate column.
  3. Apply functions to get unique words, sum the numbers, and count the source files for each word.

Here is the code:

import pandas as pd
import sys,os

files = os.listdir(currentdir)

dfs = []
for f in files:
    df = pd.read_csv(currentdir+"/"+f,sep='\t',header=None)
    df['source_file'] = f
    dfs.append(df)

def concat(x):
     return pd.Series(dict(A = x[0].unique()[0], 
                        B = x[1].sum(), 
                        C = len(x['source_file'])))    

df = pd.concat(dfs,ignore_index=True).groupby(0).apply(concat)

# Print result to standard output
df.to_csv(sys.stdout,sep='\t',header=None,index=None)

You may refer here: Pandas groupby: How to get a union of strings

Claire
  • 639
  • 9
  • 25
  • I am trying to run your code but I am getting errors-FileNotFoundError: File b'00002d74a9faa53f5199c910b652ef09d3a7f6bd42b693755a233635c3ffb0f4.apk.sys_names.txt' does not exist – ubuntu_noob May 21 '18 at 19:42
  • I ran this code print(os.path.join(root, f)) to check and the file exists – ubuntu_noob May 21 '18 at 19:42
  • @ubuntu_noob Because the file list was previously hard-coded according to your example. The code is edited to dynamically obtain the file list when you supply the `currentdir` variable. (You may use `sys.argv[1]` to do that in a script). Please see if you can run this. – Claire May 21 '18 at 19:53
  • I didnt directly copy your code...I made the appropriate changes with os.walk – ubuntu_noob May 21 '18 at 21:06
  • Please feel free. It is basically to show how you can manipulate the data easily in various ways with a Pandas dataframe. – Claire May 22 '18 at 07:26
  • Sorry for the last comment..It was a silly mistake on my part...could you pl explain why there are three columns in the output df..i couldnt understand that – ubuntu_noob May 22 '18 at 08:06
  • It's okay. The first two columns are read from the files. The third column is opened by the line `df['source_file'] = f`, to note down from which file is each row read from. – Claire May 22 '18 at 08:12
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/171520/discussion-between-ubuntu-noob-and-claire). – ubuntu_noob May 22 '18 at 08:13
0

It appears that you want to parse the file into a dictionary of lists, so that for the input you provided:

file1  file2
a  2    a 3
b  3    b 1 
c  1  

... you get the following data structure after parsing:

{'a': [2, 3], 'b': [3, 1], 'c': [1]}

From that, you can easily get everything you need.

Parsing this way should be rather simple using a defaultdict:

parsed_data = defaultdict(list)

for filename in list_of_filenames:
    with open(filename) as f:
        for line in f:
            name, number = line.split()
            parsed_data[name].append(int(number))

After that, printing the data you are interested in should be trivial:

for name, values in parsed_data.items():
    print('{} {} {}'.format(name, sum(values), len(values)))

The solution assumes that the same name will not appear twice in the same file. It is not specified what should happen in that case.

TL;DR: The solution for your problems is defaultdict.

zvone
  • 18,045
  • 3
  • 49
  • 77
  • its showing ValueError: invalid literal for int() with base 10: 'killed' in parsed_data[name].append(int(number)). – ubuntu_noob May 22 '18 at 12:22
  • @ubuntu_noob There were some bugs in the code, e.g. `split(' ')` instead of `split()`. It should now work, if your input files are well formatted. – zvone May 22 '18 at 14:14
  • its showing ValueError: too many values to unpack (expected 2)...but the file is like this "epoll_wait 5703" – ubuntu_noob May 22 '18 at 14:20