0

I have been given a basic text file and I need to use regex in python to pull all the words for each line and print the number of words per line.

Text File Example:

I have a dog.
She is small and cute,
and likes to play with other dogs.

Example Output:

Line 1: 4
Line 2: 5
Line 3: 7

Any help would be appreciated!

Nir Alfasi
  • 53,191
  • 11
  • 86
  • 129
Zoey
  • 47
  • 3
  • Please add the code, that you have written so far – Vipin Kumar Nov 21 '17 at 17:44
  • 1
    One thing to keep in mind is that the English language is not always this nice. Is _Myers-Briggs_ one word or two? Is _www.website.com_ one word? Word count machines are something where you can get as complicated as you desire. If you'd like to keep it simple, you won't need regex at all, just `str.split()`. – Brad Solomon Nov 21 '17 at 17:44
  • split by space, `sentence.split()` - it should do the trick – Nir Alfasi Nov 21 '17 at 17:45

4 Answers4

0

you can try splitting the lines

with open('input_file_name.txt') as input_file:
line_number = 1
for line in input_file.readlines():
    print( 'Line {} : {}'.format(line_number,len(line.split(' '))))
    line_number +=1
Ron
  • 197
  • 1
  • 10
0
f = open(path_to_text_file, "r") 
counter = 1
for line in f.readlines():  # read the file line by line
    print "Line %d: %d" % (counter, len(line.split(" ")))  # counts the spaces, assuming that there is only one space between words.
    counter += 1
Roopak A Nelliat
  • 2,009
  • 3
  • 19
  • 26
0

You could try awk which splits on runs of white space by default:

cat <<EOT | awk '{print NF}'
> I have a dog.
> She is small and cute,
> and likes to play with other dogs.
> EOT
4
5
7

NF is an awk variable which is always set to the number of fields in the current record.

Cole Tierney
  • 9,571
  • 1
  • 27
  • 35
0

This very intuitive regex might help:

\b\w+\b

It matches all the word characters between word boundaries. You just need to count how many matches there are.

If you want to count words with hyphens (or any other characters) as 1 word, add - to the character set:

\b[\w\-]\b

or

\b[\w\-'.]\b

etc.

You get the idea.

Sweeper
  • 213,210
  • 22
  • 193
  • 313
  • This pulls all the words in the file, but I need to count the words within a line. There is nothing to demarcate the end of the line in the output. – Zoey Nov 22 '17 at 02:29
  • @Zoey Refer to Roopak A Nelliat's answer if you don't know how to read the file line by line. – Sweeper Nov 22 '17 at 06:33