1

In Julia, how can I check an English word is a meaningful word? Suppose I want to know whether "Hello" is meaningful or not. In Python, one can use the enchant or nltk packages(Examples: [1],[2]). Is it possible to do this in Julia as well?

What I need is a function like this:

is_english("Hello")
>>>true

is_english("Hlo")
>>>false
# Because it doesn't have meaning! We don't have such a word in English terminology!

is_english("explicit")
>>>true

is_english("eeplicit")
>>>false

Here is what I've tried so far:
I have a dataset that contains frequent 5char English words(link to google drive). So I decided to augment it to my question for better understanding. Although this dataset is not adequate (because it just contains frequent 5char meaningful words, not all the meaningful English words with any length), it's suitable to use it to show what I want:

using CSV
using DataFrames
df = CSV.read("frequent_5_char_words.csv" , DataFrame , skipto=2)

df = [lowercase(item) for item in df[:,"0"]]
function is_english(word::String)::Bool
    return lowercase(word) in df
end

Then when I try these:

julia>is_english("Helo")
false

julia>is_english("Hello")
true

But I don't have an affluent dataset! So this isn't enough. So I'm curious if there are any packages like what I mentioned before, in Julia or not?

Shayan
  • 5,165
  • 4
  • 16
  • 45
  • 1
    It depends what you mean by _meaningful_ english word, but you should find what you need among these [Julia NLP Packages](https://juliapackages.com/c/nlp), for example [`WordNet.jl`](https://juliapackages.com/p/wordnet). – EricLavault Mar 13 '22 at 16:16
  • 1
    @EricLavault I mean to check whether the word is contained in English terminology or not? **I provided explicit examples in my question.** It's clear what do I mean. – Shayan Mar 13 '22 at 17:04
  • 2
    No it's not. There is a gap between being explicit and throwing "_examples: [1],[2]_" pointing to poorly related posts : for example a stopword taken alone has no meaning, yet stopwords are part of a corpus (or should they according to your definition of _meaningful_?); should a misspelled word be considered as a _meaningful_ word as if it were correctly spelled (in which case it involves spellchecker rules)? Or don't use that word if you don't want to elaborate on what you consider meaningful or not with your own examples to bring some context, or if you don't know what it means. – EricLavault Mar 13 '22 at 23:52
  • You meant something like `is_english("Hello")` then `>>> True`? – Shayan Mar 14 '22 at 07:20
  • @fandak, exactly. But some people see this as not explicit :))))) – Shayan Mar 14 '22 at 07:27
  • 1
    Well, it was so explicit that @fandak also has to ask you about what _you meant_, bringing an example of his own... so clear that you eventually edit your post. Hopefully you deigned to provide more details to your question. Now it's clear. Still, you don't want to get the difference between _is_english_ and _is_meaningful_. – EricLavault Mar 14 '22 at 10:53
  • 2
    @EricLavault No! You're wrong about my question! Your comment made me suspicious whether I got his point or not?, so I decided to ask the questionnaire whether I got his point or not! Otherwise, I would never ask it cause I got his point at a glance. Now I see he updated with further information parallel to his previous examples like [[1](https://stackoverflow.com/a/3789057/11747148)]. My example was the same as what he referenced in *[1]*! – Shayan Mar 14 '22 at 12:58

2 Answers2

1

(not enough rep to post a comment!)

You can still use NLTK in Julia via PyCall. Or, as it seems you don't need an NLP tool but just a dictionary, you can use wiktionary to do some lookup or build the dataset.

Isia S.
  • 51
  • 8
  • Can you please explain how I can use NLTK in Julia via PyCall? Can you provide a practical example? – Shayan Mar 15 '22 at 10:28
1

There is a recently new package, Named LanguageDetect.jl. It does not return true/false, but a list of probabilities. You could define something like:

using LanguageDetect: detect

function is_english(text, threshold=0.8)
  langs = detect(text)
  for lang in langs
    if lang.language == "en"
      return lang.probability >= threshold
    end
  end
  ret



longemen3000
  • 1,273
  • 5
  • 14