1

I have a input txtfile like,

The quick brown fox jumps over the lazy dog
The quick brown fox
A beautiful dog

And I have keywords saved as txtfile like,

fox dog ...

I want to check each line of the input file if it has these keywords, I know how to check the keyword one by one,

with open("input.txt") as f:
    a_file = f.read().splitlines()
b_file = []
for line in a_file:

    if "dog" in line:
        b_file.append("dog")
    elif "fox" in line:
        b_file.append("fox")
    else:
        b_file.append("Not found")
with open('output.txt', 'w') as f:
    f.write('\n'.join(b_file) + '\n')

but how to check them if they are in another file? P.S. I need to check some specific line not all content in a file and for examples, the result should like,

fox dog
fox
dog
Park
  • 2,446
  • 1
  • 16
  • 25
4daJKong
  • 1,825
  • 9
  • 21
  • Since you load the entire file into memory anyway, why not check if the word is in the contents of the entire file, instead of checking one line at a time? Do you actually want a line saying 'dog' for each line in the input file that has the keyword? What if multiple keywords appear on a line? What are you having trouble with exactly? You already know how to read the contents of a file, and you could just `.split()` the contents of the keyword file and loop over them to check if all of them are in the input file? – Grismar Feb 16 '22 at 02:49
  • Thanks for your reply, first, it is not necessary to check all file, so I need to check some specific line, if multi-words appear on one line, output all keywords – 4daJKong Feb 16 '22 at 02:58
  • @Grismar At least, I need to know the index of line which has these words...and which one word – 4daJKong Feb 16 '22 at 05:09

3 Answers3

1

You should load both files. One is for keyword query, another is for the content for searching. Ex I have a file named keywords.txt, and content.txt Then open it all:

with open("keywords.txt") as f1, open("content.txt") as f2:
    keywords = f1.read()
    content = f2.read()
# keywords: fox dog
# content: The quick brown fox jumps over the lazy dog\nThe quick brown fox\nA beautiful dog

If you only want to check if the content contains the keyword, then just do this:

keywords = [line.split() for line in keywords.split("\n")]
keywords = sum(keywords, [])
# keywords: ['fox', 'dog']

content = [line.split() for line in content.split("\n")]
content = sum(content, [])
# content: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'The', 'quick', 'brown', 'fox', 'A', 'beautiful', 'dog']

# check intersection of 2 sets, if there is some words overlap
# ==> keywords appear in the content
if set(keywords)&set(content):
    print(True)
else:
    print(False)
Binh
  • 1,143
  • 6
  • 8
  • have a look on the reivsed one, thanks again, I have to check some specific line one by one – 4daJKong Feb 16 '22 at 03:06
  • But what line tho? for example you have 100 lines, how do you know what line you want to check? – Binh Feb 16 '22 at 03:07
  • in front of each line, it has a special word, like [a] [b] [c], we only analyze those setence with[a] – 4daJKong Feb 16 '22 at 03:18
  • So your question is not clear enough. You have to describe your content file format and keywords file. Then what exactly do you want to get in return? The line number of line which contains keywords or the number of keywords found in the file? – Binh Feb 16 '22 at 04:50
  • At least, I need to know the index of line which has these words...and which one word – 4daJKong Feb 16 '22 at 05:09
1

Although you changed a few of the requirements, it appears you want this:

  • to read a list of keywords from a file with these keywords on a single line, separated by space
  • to find lines of a text document that have any of these keywords on them, and output the line number (index) of the line they appear on and exactly which keywords were on it, for all lines that have them

This script does that:

with open('keywords.txt') as f:
    keywords = f.read().split()

with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        if matches := [k for k in keywords if k in line]:
            o.write(f'{n+1}: {matches}\n')

With keywords.txt something like:

fox dog

And document.txt something like:

the quick brown fox
jumped over the lazy dog
on a beautiful dog day afternoon, you foxy dog
there is nothing on FOX
and sometimes you're in a foxhole with a dog

It will write output.txt with:

1: ['fox']
2: ['dog']
3: ['fox', 'dog']
5: ['fox', 'dog']

If you don't want partial matches (like foxhole) and if you care about the order in which words were found, and perhaps want to know about duplicates as well, and you want to make sure capitalisation doesn't matter:

with open('keywords.txt') as f:
    keywords = [k.lower() for k in f.read().split()]

with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        if matches := [w for w in line.split() if w.lower() in keywords]:
            o.write(f'{n+1}: {matches}\n')

And finally, perhaps your document.txt gets a 6th line with punctuation:

I watch "FOX", but although I search doggedly, I can't find a thing, you foxy dog!

Then this script:

import re
import string

with open('keywords.txt') as f:
    keywords = [k.lower() for k in f.read().split()]

with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        if matches := [w for w in re.sub('['+string.punctuation+']', '', line).split() if w.lower() in keywords]:
            o.write(f'{n+1}: {matches}\n')

Gets this written to output.txt:

1: ['fox']
2: ['dog']
3: ['dog', 'dog']
4: ['FOX']
5: ['dog']
6: ['FOX', 'dog']
Grismar
  • 27,561
  • 4
  • 31
  • 54
1

For all unfamiliar with Python I want to extend Grismar's manifold answer with two goals:

  1. explain the language constructs used
  2. extract all the matching-variants into functions and an enum

1. language constructs

2. extract matching variants

The Enum (class) defines the 3 proposed matching-modes. We can then use this mode for both:

  • (a) reading the keywords ready-to-match, using the extracted function keywords_from
  • (b) find matches of those keywords, using the extracted function match_keywords
from enum import Enum

class KeywordMatch(Enum):
     EXACT = 'exact'
     LOWER = 'lower'
     PARTIAL = 'partial'

# Usage: keywords = keywords_from('keywords.txt', KeywordMatch.LOWER)
def keywords_from(filename, mode):
    with open(filename) as f:
        if mode == KeywordMatch.LOWER:
            keywords = [k.lower() for k in f.read().split()]
        else:
            keywords = f.read().split()
    return keywords


import re
import string

# Usage: if match_keywords(line, KeywordMatch.LOWER):
def match_keywords(line, mode):
    if mode == KeywordMatch.LOWER
        matches = [w for w in line.split() if w.lower() in keywords]
    elif mode == KeywordMatch.PARTIAL:
        matches = [w for w in re.sub('['+string.punctuation+']', '', line).split() if w.lower() in keywords]
    else:
        matches = [k for k in keywords if k in line]
    return matches


if __name__ == "__main__":
    mode = KeywordMatch.LOWER

    keywords = keywords_from('keywords.txt', mode)
    
    with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        matches = match_keywords(line, mode)  
        # can also test or debug-print matches
        if matches: 
            o.write(f'{n+1}: {matches}\n')

Note:

  • despite all the modularization, the keywords list is still a global variable (which is not so clean)
  • removed the Walrus operator and kept matches separate to test or debug them before writing to file

See also:

hc_dev
  • 8,389
  • 1
  • 26
  • 38