4

i have a small problem i need to count words inside the console to read doc, docx, pptx, ppt, xls, xlsx, odt, pdf ... so don't suggest me | wc -w or grep because they work only with text or console output and they count only spaces and in japanese, chinese, arabic , hindu , hebrew they use diferent delimiter so the word count is wrong and i tried to count with this

pdftotext file.pdf -| wc -w
/usr/local/bin/docx2txt.pl < file.docx | wc -w
/usr/local/bin/pptx2txt.pl < file.pptx | wc -w
antiword file.doc -| wc -w 
antiword file.word -| wc -w

in some cases microsoft word , openoffice sad 1000 words and the counters return 10 or 300 words if the language is ( japanese , chinese, hindu ect... ) , but if i use normal characters then i have no issue the biggest mistake is in some case 3 chars less witch is "OK"

i tried to convert with soffice , openoffice and then try WC -w but i can't even convert ,

soffice --headless --nofirststartwizard --accept=socket,host=127.0.0.1,port=8100; --convert-to pdf some.pdf /var/www/domains/vocabridge.com/devel/temp_files/23/0/东京_1000_words_Docx.docx 

OR

 openoffice.org  --headless  --convert-to  ........

OR

openoffice.org3 --invisible 

so if someone know any way to count correctly or display document statistic with openoffice or anything else or linux with the console please share it

thanks.

ddjikic
  • 1,254
  • 12
  • 28

5 Answers5

3

If you have Microsoft Word (and Windows, obviously) you can write a VBA macro or if you want to run straight from the command line you can write a VBScript script with something like the following:

wordApp = CreateObject("Word.Application")
doc = ... ' open up a Word document using wordApp
docWordCount = doc.Words.Count
' Rinse and repeat...

If you have OpenOffice.org/LibreOffice you have similar (but more) options. If you want to stay in the office app and run a macro you can probably do that. I don't know the StarBasic API well enough to tell you how but I can give you the alternative: creating a Python script to get the word count from the command line. Roughly speaking, you do the following:

Yawar
  • 11,272
  • 4
  • 48
  • 80
1

Just building on to what @Yawar wrote. Here is is more explicit steps for how to word count with MS word from the console.

I also use the more accurate Range.ComputeStatistics(wdStatisticWords) instead of the Words property. See here for more info: https://support.microsoft.com/en-za/help/291447/word-count-appears-inaccurate-when-you-use-the-vba-words-property

  1. Make a script called wc.vbs and then put this in it:

    Set word = CreateObject("Word.Application")
    word.Visible = False
    Set doc = word.Documents.Open("<replace with absolute path to your .docx/.pdf>")
    docWordCount = doc.Range.ComputeStatistics(wdStatisticWords)
    word.Quit
    Dim StdOut : Set StdOut = CreateObject("Scripting.FileSystemObject").GetStandardStream(1)
    WScript.Echo docWordCount & " words"
    
  2. Open powershell in the directory you saved wc.vbs and run cscript .\wc.vbs and you'll get back the word count :)

nmu
  • 1,442
  • 2
  • 20
  • 40
0

I think this may do what you are aiming for

# Continuously updating word count
import unohelper, uno, os, time
from com.sun.star.i18n.WordType import WORD_COUNT
from com.sun.star.i18n import Boundary
from com.sun.star.lang import Locale
from com.sun.star.awt import XTopWindowListener

#socket = True
socket = False
localContext = uno.getComponentContext()

if socket:
    resolver = localContext.ServiceManager.createInstanceWithContext('com.sun.star.bridge.UnoUrlResolver', localContext)
    ctx = resolver.resolve('uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext')
else: ctx = localContext

smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext('com.sun.star.frame.Desktop', ctx)

waittime = 1 # seconds

def getWordCountGoal():
    doc = XSCRIPTCONTEXT.getDocument()
    retval = 0

    # Only if the field exists
    if doc.getTextFieldMasters().hasByName('com.sun.star.text.FieldMaster.User.WordCountGoal'):
        # Get the field
        wordcountgoal = doc.getTextFieldMasters().getByName('com.sun.star.text.FieldMaster.User.WordCountGoal')
        retval = wordcountgoal.Content

    return retval

goal = getWordCountGoal()

def setWordCountGoal(goal):
    doc = XSCRIPTCONTEXT.getDocument()

    if doc.getTextFieldMasters().hasByName('com.sun.star.text.FieldMaster.User.WordCountGoal'):
        wordcountgoal = doc.getTextFieldMasters().getByName('com.sun.star.text.FieldMaster.User.WordCountGoal')
        wordcountgoal.Content = goal

    # Refresh the field if inserted in the document from Insert > Fields >
    # Other... > Variables > Userdefined fields
    doc.TextFields.refresh()

def printOut(txt):
    if socket: print txt
    else:
        model = desktop.getCurrentComponent()
        text = model.Text
        cursor = text.createTextCursorByRange(text.getEnd())
        text.insertString(cursor, txt + '\r', 0)

def hotCount(st):
    '''Counts the number of words in a string.

    ARGUMENTS:

    str st: count the number of words in this string

    RETURNS:

    int: the number of words in st'''
    startpos = long()
    nextwd = Boundary()
    lc = Locale()
    lc.Language = 'en'
    numwords = 1
    mystartpos = 1
    brk = smgr.createInstanceWithContext('com.sun.star.i18n.BreakIterator', ctx)
    nextwd = brk.nextWord(st, startpos, lc, WORD_COUNT)
    while nextwd.startPos != nextwd.endPos:
        numwords += 1
        nw = nextwd.startPos
        nextwd = brk.nextWord(st, nw, lc, WORD_COUNT)

    return numwords

def updateCount(wordCountModel, percentModel):
    '''Updates the GUI.
    Updates the word count and the percentage completed in the GUI. If some
    text of more than one word is selected (including in multiple selections by
    holding down the Ctrl/Cmd key), it updates the GUI based on the selection;
    if not, on the whole document.'''

    model = desktop.getCurrentComponent()
    try:
        if not model.supportsService('com.sun.star.text.TextDocument'):
            return
    except AttributeError: return

    sel = model.getCurrentSelection()
    try: selcount = sel.getCount()
    except AttributeError: return

    if selcount == 1 and sel.getByIndex(0).getString == '':
        selcount = 0

    selwords = 0
    for nsel in range(selcount):
        thisrange = sel.getByIndex(nsel)
        atext = thisrange.getString()
        selwords += hotCount(atext)

    if selwords > 1: wc = selwords
    else:
        try: wc = model.WordCount
        except AttributeError: return
    wordCountModel.Label = str(wc)

    if goal != 0:
        pc_text =  100 * (wc / float(goal))
        #pc_text = '(%.2f percent)' % (100 * (wc / float(goal)))
        percentModel.ProgressValue = pc_text
    else:
        percentModel.ProgressValue = 0

# This is the user interface bit. It looks more or less like this:

###############################
# Word Count            _ o x #
###############################
#            _____            #
#     451 /  |500|            #
#            -----            #
# ___________________________ #
# |##############           | #
# --------------------------- #
###############################

# The boxed `500' is the text entry box.

class WindowClosingListener(unohelper.Base, XTopWindowListener):
    def __init__(self):
        global keepGoing

        keepGoing = True
    def windowClosing(self, e):
        global keepGoing

        keepGoing = False
        setWordCountGoal(goal)
        e.Source.setVisible(False)

def addControl(controlType, dlgModel, x, y, width, height, label, name = None):
    control = dlgModel.createInstance(controlType)
    control.PositionX = x
    control.PositionY = y
    control.Width = width
    control.Height = height
    if controlType == 'com.sun.star.awt.UnoControlFixedTextModel':
        control.Label = label
    elif controlType == 'com.sun.star.awt.UnoControlEditModel':
        control.Text = label
    elif controlType == 'com.sun.star.awt.UnoControlProgressBarModel':
        control.ProgressValue = label

    if name:
        control.Name = name
        dlgModel.insertByName(name, control)
    else:
        control.Name = 'unnamed'
        dlgModel.insertByName('unnamed', control)

    return control

def loopTheLoop(goalModel, wordCountModel, percentModel):
    global goal

    while keepGoing:
        try: goal = int(goalModel.Text)
        except: goal = 0
        updateCount(wordCountModel, percentModel)
        time.sleep(waittime)

if not socket:
    import threading
    class UpdaterThread(threading.Thread):
        def __init__(self, goalModel, wordCountModel, percentModel):
            threading.Thread.__init__(self)

            self.goalModel = goalModel
            self.wordCountModel = wordCountModel
            self.percentModel = percentModel

        def run(self):
            loopTheLoop(self.goalModel, self.wordCountModel, self.percentModel)

def wordCount(arg = None):
    '''Displays a continuously updating word count.'''
    dialogModel = smgr.createInstanceWithContext('com.sun.star.awt.UnoControlDialogModel', ctx)

    dialogModel.PositionX = XSCRIPTCONTEXT.getDocument().CurrentController.Frame.ContainerWindow.PosSize.Width / 2.2 - 105
    dialogModel.Width = 100
    dialogModel.Height = 30
    dialogModel.Title = 'Word Count'

    lblWc = addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 6, 2, 25, 14, '', 'lblWc')
    lblWc.Align = 2 # Align right
    addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 33, 2, 10, 14, ' / ')
    txtGoal = addControl('com.sun.star.awt.UnoControlEditModel', dialogModel, 45, 1, 25, 12, '', 'txtGoal')
    txtGoal.Text = goal

    #addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 6, 25, 50, 14, '(percent)', 'lblPercent')

    ProgressBar = addControl('com.sun.star.awt.UnoControlProgressBarModel', dialogModel, 6, 15, 88, 10,'' , 'lblPercent')
    ProgressBar.ProgressValueMin = 0
    ProgressBar.ProgressValueMax =100
    #ProgressBar.Border = 2
    #ProgressBar.BorderColor = 255
    #ProgressBar.FillColor = 255
    #ProgressBar.BackgroundColor = 255

    addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 124, 2, 12, 14, '', 'lblMinus')

    controlContainer = smgr.createInstanceWithContext('com.sun.star.awt.UnoControlDialog', ctx)
    controlContainer.setModel(dialogModel)

    controlContainer.addTopWindowListener(WindowClosingListener())
    controlContainer.setVisible(True)
    goalModel = controlContainer.getControl('txtGoal').getModel()
    wordCountModel = controlContainer.getControl('lblWc').getModel()
    percentModel = controlContainer.getControl('lblPercent').getModel()
    ProgressBar.ProgressValue = percentModel.ProgressValue

    if socket:
        loopTheLoop(goalModel, wordCountModel, percentModel)
    else:
        uthread = UpdaterThread(goalModel, wordCountModel, percentModel)
        uthread.start()

keepGoing = True
if socket:
    wordCount()
else:
    g_exportedScripts = wordCount,

Link for more info

https://superuser.com/questions/529446/running-word-count-in-openoffice-writer

Hope this helps regards tom

EDIT : Then i found this

http://forum.openoffice.org/en/forum/viewtopic.php?f=7&t=22555

Community
  • 1
  • 1
06needhamt
  • 1,555
  • 2
  • 19
  • 38
  • can you do it in other not only writer ? and console only – ddjikic Mar 09 '13 at 18:01
  • This only works in writer as far as i know sorry but one possible solution is to copy the text into writer and count it that way – 06needhamt Mar 09 '13 at 18:22
  • 1
    The above code is outdated. I've since simplified. You can find the latest at https://bitbucket.org/yawaramin/oo.o-live-word-count/src/tip/wc.py -- but see my answer attempt. – Yawar Apr 14 '13 at 07:19
0

wc can understand Unicode and uses system's iswspace function to find whether the unicode character is whitespace. "The iswspace() function tests whether wc is a wide-character code representing a character of class space in the program's current locale." So, wc -w should be able to correctly count words if your locale (LC_CTYPE) is configured correctly.

The source code of the wc program

The manual page for the iswspace function

Ark-kun
  • 6,358
  • 2
  • 34
  • 70
0

I found the answer create one service

#!/bin/sh
#
# chkconfig: 345 99 01
#
# description: your script is a test service
#

(while sleep 1; do
  ls pathwithfiles/in | while read file; do
    libreoffice --headless -convert-to pdf "pathwithfiles/in/$file" --outdir pathwithfiles/out
    rm "pathwithfiles/in/$file"
  done
done) &

then the php script that i needed counted everything

 $ext = pathinfo($absolute_file_path, PATHINFO_EXTENSION);
        if ($ext !== 'txt' && $ext !== 'pdf') {
            // Convert to pdf
            $tb = mktime() . mt_rand();
            $tempfile = 'locationofpdfs/in/' . $tb . '.' . $ext;
            copy($absolute_file_path, $tempfile);
            $absolute_file_path = 'locationofpdfs/out/' . $tb . '.pdf';
            $ext = 'pdf';
            while (!is_file($absolute_file_path)) sleep(1);
        }
        if ($ext !== 'txt') {
            // Convert to txt
            $tempfile = tempnam(sys_get_temp_dir(), '');
            shell_exec('pdftotext "' . $absolute_file_path . '" ' . $tempfile);
            $absolute_file_path = $tempfile;
            $ext = 'txt';
        }
        if ($ext === 'txt') {
            $seq = '/[\s\.,;:!\? ]+/mu';
            $plain = file_get_contents($absolute_file_path);
            $plain = preg_replace('#\{{{.*?\}}}#su', "", $plain);
            $str = preg_replace($seq, '', $plain);
            $chars = count(preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY));
            $words = count(preg_split($seq, $plain, -1, PREG_SPLIT_NO_EMPTY));
            if ($words === 0) return $chars;
            if ($chars / $words > 10) $words = $chars;
            return $words;
        }
ddjikic
  • 1,254
  • 12
  • 28