How to display all words that contain these characters?

Question

I have a text file and I want to display all words that contains both z and x characters.

How can I do that ?

Regular Expressions are king when it comes to text parsing. Look at Ishpeck's solution. — Squirrelsama, Oct 18 '10 at 20:54

score 12 · Accepted Answer · answered Oct 18 '10 at 20:06

12

If you don't want to have 2 problems:

for word in file('myfile.txt').read().split():
    if 'x' in word and 'z' in word:
        print word

answered Oct 18 '10 at 20:06

Wooble

87,717
12
108
131

1

Thank goodness you provided an answer that *doesn't* use regular expressions. – gotgenes Oct 18 '10 at 20:14
+1: I like this very much. The only problem I can see is that you'll get any punctuation surrounding your words too, not just the words themselves. – Tim Pietzcker Oct 18 '10 at 20:16
True, I'm using python's definition of "words", which might be unreasonable here. – Wooble Oct 18 '10 at 20:18
Punctuation is pretty trivial to remove efficiently http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python – gotgenes Oct 18 '10 at 20:56

score 8 · Answer 2 · answered Oct 18 '10 at 19:59

8

Assuming you have the entire file as one large string in memory, and that the definition of a word is "a contiguous sequence of letters", then you could do something like this:

import re
for word in re.findall(r"\w+", mystring):
    if 'x' in word and 'z' in word:
        print word

answered Oct 18 '10 at 19:59

Tim Pietzcker

328,213
58
503
561

I like this answer. It's the cleanest solution. If performance becomes an issue, time it against my solution and pick the winner. – Steven Rumbalski Oct 18 '10 at 20:14

Steven Rumbalski · Answer 3 · 2010-10-18T21:30:25.767

4

>>> import re
>>> pattern = re.compile('\b(\w*z\w*x\w*|\w*x\w*z\w*)\b')
>>> document = '''Here is some data that needs
... to be searched for words that contain both z
... and x.  Blah xz zx blah jal akle asdke asdxskz
... zlkxlk blah bleh foo bar'''
>>> print pattern.findall(document)
['xz', 'zx', 'asdxskz', 'zlkxlk']

edited Oct 18 '10 at 21:30

answered Oct 18 '10 at 20:05

Steven Rumbalski

44,786
9
89
119

I can confirm this works and is better than my reply. I'll delete mine in favor of this one. – Ishpeck Oct 18 '10 at 21:03

score 3 · Answer 4 · edited May 23 '17 at 12:31

I just want to point out how heavy-handed some of these regular expressions can be, in comparison to the simple string methods-based solution provided by Wooble.

Let's do some timings, shall we?

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import timeit
import re
import sys

WORD_RE_COMPILED = re.compile(r'\w+')
Z_RE_COMPILED = re.compile(r'(\b\w*z\w*\b)')
XZ_RE_COMPILED = re.compile(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b')

##########################
# Tim Pietzcker's solution
# https://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3962876#3962876
#
def xz_re_word_find(text):
    for word in re.findall(r'\w+', text):
        if 'x' in word and 'z' in word:
            print word


# Tim's solution, compiled
def xz_re_word_compiled_find(text):
    pattern = re.compile(r'\w+')
    for word in pattern.findall(text):
        if 'x' in word and 'z' in word:
            print word


# Tim's solution, with the RE pre-compiled so compilation doesn't get
# included in the search time
def xz_re_word_precompiled_find(text):
    for word in WORD_RE_COMPILED.findall(text):
        if 'x' in word and 'z' in word:
            print word


################################
# Steven Rumbalski's solution #1
# (provided in the comment)
# https://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3963285#3963285
def xz_re_z_find(text):
    for word in re.findall(r'(\b\w*z\w*\b)', text):
        if 'x' in word:
            print word


# Steven's solution #1 compiled
def xz_re_z_compiled_find(text):
    pattern = re.compile(r'(\b\w*z\w*\b)')
    for word in pattern.findall(text):
        if 'x' in word:
            print word


# Steven's solution #1 with the RE pre-compiled
def xz_re_z_precompiled_find(text):
    for word in Z_RE_COMPILED.findall(text):
        if 'x' in word:
            print word


################################
# Steven Rumbalski's solution #2
# https://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3962934#3962934
def xz_re_xz_find(text):
    for word in re.findall(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b', text):
        print word


# Steven's solution #2 compiled
def xz_re_xz_compiled_find(text):
    pattern = re.compile(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b')
    for word in pattern.findall(text):
        print word


# Steven's solution #2 pre-compiled
def xz_re_xz_precompiled_find(text):
    for word in XZ_RE_COMPILED.findall(text):
        print word


#################################
# Wooble's simple string solution
def xz_str_find(text):
    for word in text.split():
        if 'x' in word and 'z' in word:
            print word


functions = [
        'xz_re_word_find',
        'xz_re_word_compiled_find',
        'xz_re_word_precompiled_find',
        'xz_re_z_find',
        'xz_re_z_compiled_find',
        'xz_re_z_precompiled_find',
        'xz_re_xz_find',
        'xz_re_xz_compiled_find',
        'xz_re_xz_precompiled_find',
        'xz_str_find'
]

import_stuff = functions + [
        'text',
        'WORD_RE_COMPILED',
        'Z_RE_COMPILED',
        'XZ_RE_COMPILED'
]


if __name__ == '__main__':

    text = open(sys.argv[1]).read()
    timings = {}
    setup = 'from __main__ import ' + ','.join(import_stuff)
    for func in functions:
        statement = func + '(text)'
        timer = timeit.Timer(statement, setup)
        min_time = min(timer.repeat(3, 10))
        timings[func] = min_time


    for func in functions:
        print func + ":", timings[func], "seconds"

Running this script on a plaintext copy of Moby Dick obtained from Project Gutenberg, on Python 2.6, I get the following timings:

xz_re_word_find: 1.21829485893 seconds
xz_re_word_compiled_find: 1.42398715019 seconds
xz_re_word_precompiled_find: 1.40110301971 seconds
xz_re_z_find: 0.680151939392 seconds
xz_re_z_compiled_find: 0.673038005829 seconds
xz_re_z_precompiled_find: 0.673489093781 seconds
xz_re_xz_find: 1.11700701714 seconds
xz_re_xz_compiled_find: 1.12773990631 seconds
xz_re_xz_precompiled_find: 1.13285303116 seconds
xz_str_find: 0.590088844299 seconds

In Python 3.1 (after using 2to3 to fix the print statements), I get the following timings:

xz_re_word_find: 2.36110496521 seconds
xz_re_word_compiled_find: 2.34727501869 seconds
xz_re_word_precompiled_find: 2.32607793808 seconds
xz_re_z_find: 1.32204890251 seconds
xz_re_z_compiled_find: 1.34104800224 seconds
xz_re_z_precompiled_find: 1.34424304962 seconds
xz_re_xz_find: 2.33851099014 seconds
xz_re_xz_compiled_find: 2.29653286934 seconds
xz_re_xz_precompiled_find: 2.32416701317 seconds
xz_str_find: 0.656699895859 seconds

We can see that the regular expression-based functions tend to take twice as long to run as the string methods-based function in Python 2.6, and over 3 times as long in Python 3. The time difference is trivial for one-off parsing (nobody's going to miss those milliseconds), but for cases where the function must be called many times, the string methods-based approach is both simpler and faster.

I too prefer string methods. But, here's a nitpick. I changed the definition of zx_re_find(text) and it's 4x faster than pure string method: def zx_re_find(text): pat = re.compile('(\b\w*z\w*\b)') for word in pat.findall(text): if 'x' in word: print word — Steven Rumbalski, Oct 18 '10 at 21:25
@Steven I have updated my answer to include include both your suggested solution in the comment, and the solution you provided as an answer, and did not obtain 4X performance by any regular expression compared to the string method. For me, the RE solutions still trail behind. What text did you use to test your performance? — gotgenes, Oct 18 '10 at 22:33
@gotgenes I used the same plaintext copy of Moby Dick. I used python 2.7 on Windows XP on (hmm.. forgot chip in my work laptop). I do recall the first 3 digits of the timings 0.311 for string and 0.088 for regex (not really 4x, but close). I maintain that if the requirements were any more complicated, the regex would gain in simplicity and performance. — Steven Rumbalski, Oct 18 '10 at 23:47
@gotgenes Also, there would be some easy ways to /try/ to speed up the string methods approach, namely to test line by line for existence of 'x' and 'z' (they are, after all, infrequent letters), and then split to words from there. — Steven Rumbalski, Oct 18 '10 at 23:50
@Steven Perhaps this regular expression speedup is platform-specific? For example, http://bugs.python.org/issue8064 — gotgenes, Oct 19 '10 at 00:14
@gotgenes Whoops. My implementation had a bug. So my 4x speedup was entirely invalid. — Steven Rumbalski, Oct 19 '10 at 14:32
@gotgenes I did attempt the string methods optimization I mentioned and got a 6x speedup. Optimization was: Iterate by line. If line contains x and z, iterate by word. If word contains x and z, print it. But this is all moot, your original point stands that "the string methods-based approach is both simpler and faster." It was a fun exercise. — Steven Rumbalski, Oct 19 '10 at 14:36
@Steven I feel it was fun, too. As Python is an interpreted language, it's not always obvious which approach will give the best performance. (I actually thought one of the regexp approaches could win.) It's always good to re-visit performance empirically, and I like opportunities to use the nice `timeit` module. — gotgenes, Oct 19 '10 at 15:31

score 2 · Answer 5 · answered Oct 18 '10 at 20:40

I do not know the performance of this generator, but for me this is the way:

from __future__ import print_function
import string

bookfile = '11.txt' # Alice in Wonderland
hunted = 'az' # in your case xz but there is none of those in this book

with open(bookfile) as thebook:
    # read text of book and split from white space
    print('\n'.join(set(word.lower().strip(string.punctuation)
                    for word in thebook.read().split()
                    if all(c in word.lower() for c in hunted))))
""" Output:
zealand
crazy
grazed
lizard's
organized
lazy
zigzag
lizard
lazily
gazing
""

"

score 0 · Answer 6 · answered Oct 18 '10 at 19:57

0

Sounds like a job for Regular Expressions. Read that and try it out. If you run into problems, update your question and we can help you with the specifics.

answered Oct 18 '10 at 19:57

Brad Mace

27,194
17
102
148

score 0 · Answer 7 · answered Oct 18 '10 at 20:07

0

>>> import re
>>> print re.findall('(\w*x\w*z\w*|\w*z\w*x\w*)', 'axbzc azb axb abc axzb')
['axbzc', 'axzb']

answered Oct 18 '10 at 20:07

Paweł Nadolski

8,296
2
42
32

How to display all words that contain these characters?

7 Answers7

Linked

Related