Python Text file statistics

Question

I am trying to use python to read a short text file,clean it up by removing punctuation(: , . ! ?) and find number of lines and words. The file I created has 3 lines,but after I removed punctuation it shows there are 5 lines...what did I do wrong? Please help.Here are my

word_count = 0
line_count = 0
with open('book.txt','r') as file:
    data = file.read()
    for char in ': , . ! ?':
        data = data.replace(char,' ')
    wordslist = data.split()
    for line in wordslist:
        line_count += 1
        word_count += len(wordslist)
print(word_count,line_count)

Matt Coubrough · Answer 1 · 2018-05-24T05:04:43.977

There are a few problems with your code.

Specifically, the method split() with no arguments splits a string on any white-space rather than line boundaries. splitlines() will split on lines.

Additionally your code:

word_count += len(wordslist)

is adding the length of the wordslist to the word count for each element of your wordslist. This is almost definitely not what you want!

Also note that your code:

for char in ': , . ! ?': 
    data = data.replace(char,' ')

is replacing each character from the supplied string (': , . ! ?':') with a space. However because your string of characters contains spaces itself, you are needlessly replacing all spaces in data with spaces 4 times over. It won't change the results, but it makes your code less efficient.

Instead you could achieve more correct results with code like this:

with open('book.txt','r') as file:
    data = file.read()
    for char in ':,.!?':
        data = data.replace(char,' ')
    word_count = len(data.split())      #count of words separated by whitespace
    line_count = len(data.splitlines()) #count of lines in data
print(word_count,line_count)

Addendum

It was also asked in comments how to get the character count. Assuming that the character count should count all characters that are not whitespace (tabs, newlines etc) or in the list of special characters, then it could be done with regular expressions:

import re
   #original code that stripped out punctuation here
   chars_only = re.sub(r"\s+", "", data, flags=re.UNICODE)
   char_count = len(chars_only)

re.sub performs a regular expression substitution, replacing characters that match the expression r"\s+" (which is the equivalent of "all whitespace characters") with the second argument - an empty string in this case.

However, it should be noted that this char_count would include any punctuation characters that aren't in the original list of special punctuation characters (such as apostrophes).

That would be another question entirely, but you'd have to remove all *whitespace* from `data` and then take its length. If the whitespace is *only spaces* and not newlines/tabs etc then you could just use `len(data.replace(' ',''))`, However to remove all whitespace including tabs newlines etc would require regular expressions such as this answer: https://stackoverflow.com/a/28607213/3651800 — Matt Coubrough, May 24 '18 at 04:32
I have added an example of how regular expressions could be used to get a char only count - 'regex' may seem a bit scary at first, and there are non-regex ways of doing this, but learning regex is definitely worthwhile for any software developer as it can often reduce dozens of lines of code to a single line. — Matt Coubrough, May 24 '18 at 05:08

score 0 · Answer 2 · 2018-05-24T04:24:42.080

0

You can try this:

word_count = 0
line_count = 0
with open('sample1.txt','r') as file:
    data = file.readlines()
    for line in data:
        if '?,!:.' in line:
            data = line.strip('?,!@')
    for line in data:
        line_count += 1
        word_count += len(line.split(' '))
print(word_count, line_count)

Explanation:

Here strip() will delete the characters whichever we not needed.

Using readlines will read all lines and put in a list format.

    for line in data:
    if '?,!:.' in line:
        data = line.strip('?,!@')

Here you are iterating in each line and checking whether in each line ?:., is there or not. If its there you are stripping it.

    for line in data:
        line_count += 1
        word_count += len(line.split(' '))

Here you are iterating over each line and counting the line count line_count +=1 and word_count by word_count += len(line.split(' ')), splitting the line into words using space as delimiter.

edited May 24 '18 at 04:24

answered May 24 '18 at 04:05

@Chris: Is this what you are expecting ? – May 24 '18 at 04:07
this won't solve the problem he's asking. `split()` function is still where things go wrong. – Ramy M. Mousa May 24 '18 at 04:10
`word_count += len(wordslist)` is also problematic – Matt Coubrough May 24 '18 at 04:11
@MattCoubrough: Thanks updated now. Please confirm it – May 24 '18 at 04:19
@Chris: This is tested in my test env. This worked well. – May 24 '18 at 04:26

Ramy M. Mousa · Answer 3 · 2018-05-24T04:33:46.870

0

The .split() won't actually work as you might expect with lines. You need .splitlines()

word_count = 0
line_count = 0
with open('books.txt','r') as file:
    data = file.read()
    for line in data:
        if '?,!:.' in line:
            data = line.strip('?,!@')
    #Here is the part you need
    wordslist = data.splitlines()
    for line in wordslist:
        line_count += 1
        word_count += len(line.split())
print(word_count, line_count)

Or you can use .split('\n') will do the same job.

edited May 24 '18 at 04:33

answered May 24 '18 at 04:09

Ramy M. Mousa

5,727
3
34
45

1

this will not work. `word_count += len(wordslist)` is adding the length of the lines array to the number of words for each line – Matt Coubrough May 24 '18 at 04:12

Python Text file statistics

3 Answers3