0

I have a column in my dataframe which has rows of texts and an example is shown below:

 DF$Title

 [1] "This is an example text which contains some numbers written in words such as one, two, three. The text one continues to text two and text three. I also have five hundred fifty dollars. But I am looking for five hundred thousands three hundred fourty seven more to invest into some stocks.

Now what I want to do, find all the words which are actually numbers such as One, two, five hundred thousands etc. and remove these from the text. so that the above text becomes like:

 DF$CleanedTitle

 [1] "This is an example text which contains some numbers written in words such as. The text continues to text and text. I also have dollars. But I am looking for more to invest into some stocks.

I found this question really helpful but it was to convert words into numbers and not for removal.

Is there a better alternative?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
LeMarque
  • 733
  • 5
  • 21
  • 1
    An idea: use that code and then run `gsub("\\s*\\d+", "", result)`. Well, it will remove all the digits. Your answer is rather broad, what about `one hundred and nine`, naught point two`, etc. ? – Wiktor Stribiżew Jul 06 '18 at 07:55
  • @WiktorStribiżew exactly, there is difficulty. So i was thinking of we have, for example, one hundred, change it to 100, then keep "and" and change nine to 9, so that we can remove digits, and stopwords removal will take care of "and", isn't it? – LeMarque Jul 06 '18 at 08:02
  • Similarly, for `naught point two` we can convert wherever we have point to a dot `( . )` and then two to 2. I think this will help. or may be. – LeMarque Jul 06 '18 at 08:05
  • If you plan to remove stopwords, you may do that as the first step. If you plan to protect existing numbers, protect them with, e.g., braces. Then convert spelled out numbers to digits. Remove the digits not inside braces. – Wiktor Stribiżew Jul 06 '18 at 08:05
  • 2
    check the examples in textclean::replace_number. there are some examples on how to go from one to 1. But it is a slow process if you have big numbers. Because first you have to create a big vector for every possible combination of numbers in your text (e.g. from 1 to 1.000.000 aka a vector of 1 million) and use that vector to replace the written numbers (or use as a stopword.) – phiver Jul 06 '18 at 09:30

1 Answers1

1

Here is an attempt. It's quite likely I've not thought of some words or phrases, but it gets the right answer on the question asker's input:

number_word_remover <- function(phrase) {
    out = c()

    # words we know for sure are numbers
    number_words <- c("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine",
                    "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen",
                    "eighteen", "nineteen", "twenty", "thirty", "forty", "fifty", "sixty", "seventy",
                    "eighty", "ninety", "hundred", "thousand", "million", "billion", "trillion", "half", 
                    "quarter", "third", "fourth", "fifth", "sixth", "seventh", "eighth", "ninth", "tenth", 
                    "eleventh", "twelfth", "thirteenth", "fourteenth", "fifteenth", "sixteenth",
                    "seventeenth", "eighteenth", "nineteenth", "twentieth", "thirtieth", "fortieth",
                    "fiftieth", "sixtieth", "seventieth", "eightieth", "ninetieth", "hundredth",
                    "thousandth", "millionth", "billionth", "trillionth", "ones", "twos", "threes", "fours",
                    "fives", "sixes", "sevens", "eights", "nines", "tens", "elevens", "twelves", "thirteens",
                    "fourteens", "fifteens", "sixteens", "seventeens", "eighteens", "nineteens", "twenties",
                    "thirties", "forties", "fifties", "sixties", "seventies", "eighties", "nineties",
                    "hundreds", "thousands", "millions", "billions", "trillions", "halves", "quarters",
                    "thirds", "fourths", "fifths", "sixths", "sevenths", "eighths", "ninths", "tenths",
                    "elevenths", "twelfths", "thirteenths", "fourteenths", "fifteenths", "sixteenths",
                    "seventeenths", "eighteenths", "nineteenths", "twentieths", "thirtieths", "fortieths",
                    "fiftieths", "sixtieths", "seventieths", "eightieths", "ninetieths", "hundredths",
                    "thousandths", "millionths", "billionths", "trillionths", "zeroes", "nought", "naught", "nil", "fourty")
    # words we think are probably numbers. If in doubt, check the next number. If that word is in the number or possible number word list, then we flag it as a number word
    possible_number_words <- c("minus", "and", "point")
    phrase <- str_split(phrase, " ")[[1]]
    for (i in seq_along(phrase)) {
        good = F
        # remove punctuation from word
        cleaned_word <- gsub("[[:punct:]]", "", phrase[i])
        if (cleaned_word %in% possible_number_words) {
            next_cleaned_word <- gsub("[[:punct:]]", "", phrase[i+1])
            if (!(next_cleaned_word %in% number_words | next_cleaned_word %in% possible_number_words)) {
                good = T
            }
        } else if (!(cleaned_word %in% number_words)) {
            good = T
        }
        if (good) {
            out <- c(out, phrase[i])
        } else if (substr(phrase[i], nchar(phrase[i]), nchar(phrase[i])) == ".") {
            # put a period on the last word
            out[length(out)] <- paste0(out[length(out)], ".")
        }
    }
    return(paste(out, collapse = " "))
}

example <- "This is an example text which contains some numbers written in words such as one, two, three. The text one continues to text two and text three. I also have five hundred fifty dollars. But I am looking for five hundred thousands three hundred fourty seven more to invest into some stocks."

number_word_remover(example)
[1] "This is an example text which contains some numbers written in words such as. The text continues to text and text. I also have dollars. But I am looking for more to invest into some stocks."
Mark
  • 7,785
  • 2
  • 14
  • 34