0

How do I generate bigrams using basic language?

I can do that in Python like this...

import nltk, sys
from nltk.tokenize import word_tokenize
sys.stdout = open("mygram1.txt", "w")
with open("mytext.txt") as f:
for text in f:
    tokens = nltk.word_tokenize(text)
    bigrm = (nltk.bigrams(tokens))
    print(*map(' '.join, bigrm), sep='\n')

But I need a macro that I can run in Libreoffice writer. I do not want to use Python.


Update:

just like bigrams, nltk has trigrams method that I call using nltk.trigrams And if I need four or five grams there is everygrams!

from nltk import everygrams
import nltk, sys
from nltk.tokenize import word_tokenize
sys.stdout = open("myfourgram1.txt", "w")
with open("/home/ubuntu/mytext.txt") as f:
  for text in f:
      tokens = nltk.word_tokenize(text)
      for i in list(everygrams(tokens, 4, 4)):
          print((" ".join(i)))

Is it possible in libreoffice basic?

shantanuo
  • 31,689
  • 78
  • 245
  • 403
  • You could traverse the document word by word using Andrew Pitonyak's Basic macros, and then store the results in some kind of associative array, such as a Collection object or EnumerableMap. IMHO you're making things harder by not using Python though. – Jim K May 14 '22 at 14:15

1 Answers1

1

You could replicate the behaviour of your Python code by recycling the code in my answer to your previous question (Can you Print the wavy lines generated by Spell check in writer?). First strip out all the stuff relating to spell checking, generating alternatives and sorting, thereby making it considerably shorter, and change the line that inserts the results into the new document to make it just insert pairs of words. Rather than having your input text in a .txt file, you would have to put them into a writer document, and the results would appear in a new writer document.

It should look something like the listing below. This also includes the subsidiary function IsWordSeparator()

Option Explicit

Sub ListBigrams

    Dim oSource As Object 
    oSource = ThisComponent

    Dim oSourceCursor As Object
    oSourceCursor = oSource.getText.createTextCursor()
    oSourceCursor.gotoStart(False)
    oSourceCursor.collapseToStart()

    Dim oDestination As Object
    oDestination = StarDesktop.loadComponentFromURL( "private:factory/swriter",  "_blank", 0, Array() )

    Dim oDestinationText as Object
    oDestinationText = oDestination.getText()

    Dim oDestinationCursor As Object
    oDestinationCursor = oDestinationText.createTextCursor()

    Dim s As String, sParagraph As String, sPreviousWord As String, sThisWord As String    
    Dim i as Long, j As Long, nWordStart As Long, nWordEnd As Long, nChar As Long
    Dim bFirst as Boolean
    
    sPreviousWord = ""
    bFirst = true

    Do
        oSourceCursor.gotoEndOfParagraph(True)
        sParagraph = oSourceCursor.getString() & " " 'It is necessary to add a space to the end of
        'the string otherwise the last word of the paragraph is not recognised.
        
        nWordStart = 1
        nWordEnd = 1
        
        For i = 1 to Len(sParagraph)
        
            nChar = ASC(Mid(sParagraph, i, 1))
            
            If IsWordSeparator(nChar) Then   '1
            
                If nWordEnd > nWordStart Then   '2
                
                sThisWord = Mid(sParagraph, nWordStart, nWordEnd - nWordStart)
                                    
                If bFirst Then
                    bFirst = False
                Else
                    oDestinationText.insertString(oDestinationCursor, sPreviousWord & " " & sThisWord & Chr(13), False)
                EndIf
                                
                sPreviousWord = sThisWord
                
                End If   '2                
                nWordEnd = nWordEnd + 1
                nWordStart = nWordEnd                   
                Else                
                nWordEnd = nWordEnd + 1                   
            End If    '1

        Next i

    Loop While oSourceCursor.gotoNextParagraph(False)

End Sub

'----------------------------------------------------------------------------

' OOME Listing 360. 
Function IsWordSeparator(iChar As Long) As Boolean

    ' Horizontal tab \t 9
    ' New line \n 10
    ' Carriage return \r 13
    ' Space   32
    ' Non-breaking space   160     

    Select Case iChar
    Case 9, 10, 13, 32, 160
        IsWordSeparator = True
    Case Else
        IsWordSeparator = False
    End Select    
End Function

Even if it would be easier to do it in Python, as Jim K suggested, the BASIC approach would make it easier to distribute the functionality to users, since they would not have to install Python and the NLTK library (which is not straightforward).

Howard Rudd
  • 901
  • 4
  • 6