How to extract rows with only meaningful text in a column

Question

I have a large excel file like the following:

Timestamp       Text                                Work        Id
5/4/16 17:52    rain a lot the packs maybe damage.  Delivery    XYZ
5/4/16 18:29    wh. screen                          Other       ABC
5/4/16 14:54    15107 Lane Pflugerville, 
                TX customer called me and his phone 
                number and my phone numbers were not 
                masked. thank you customer has had a 
                stroke and items were missing from his 
                delivery the cleaning supplies for his 
                wet vacuum steam cleaner.  he needs a 
                call back from customer support     Delivery    YYY
5/6/16 13:05    How will I know if I                Signing up  ASX
5/4/16 23:07    an quality                          Delivery    DFC

I want to work only on the "Text" column and then eliminate those row that have basically just have gibberish in the "Text" column (rows 2,4,5 from the above example).

I'm reading only the 2nd column as follow:

import xlrd
book = xlrd.open_workbook("excel.xlsx")
sheet = book.sheet_by_index(0)
for row_index in xrange(1, sheet.nrows): # skip heading row
    timestamp, text = sheet.row_values(row_index, end_colx=2)
    text)
    print (text)

How do I remove the gibberish rows? I have an idea that I need to work with nltk and have a positive corpus (one that does not have any gibberish), one negative corpus (only having gibberish text), and train my model with it. But how do I go about implementing it? Please help!!

score 0 · Answer 1 · edited May 23 '17 at 12:26

You can use nltk to do the following.

import nltk
english_words = set(w.lower() for w in nltk.corpus.words.words())

'a' in english_words
True

'dog' in english_words
True

'asdasdase' in english_words
False

How to get individual words in nltk from string:

individual_words_front_string = nltk.word_tokenize('This is my text from text column')

individual_words_front_string
['This', 'is,' 'my', 'text', 'from', 'text', 'column']

For each rows text column, test the individual words to see if they are in the english dictionary. If they all are, you know that rows text column us not gibberish.

If your definition of gibberish vs non-gibberish is different than english words found in nltk, you can use the same process above, just with a different list of acceptable words.

How to accept numbers and street addresses?

Simple way to determine if something is a number.

word = '32423432' 
word.isdigit()
True

word = '32423432ds' 
word.isdigit()
False

Addresses are more difficult. You can find info on that here:Parsing Addresses, and probably many other places. Of course you can always use the above logic if you have access to a list of cities, states, roads...etc.

Will it fail if any one word is False?

It's your code you decide. Perhaps you can mark something as gibberish if x% of words in the text are false?

How to determine if grammar is correct?

This is a bigger topic, and a more in-depth explanation can be found at the following link: Checking Grammar. But the above answer will just check if words are in the nltk corpus, not whether or not the sentence is grammatically correct.

Great thanks! But how will I make it accept numbers and street addresses? Also, will it return false if the entire sentence falls under english words except say one/two words? — Arman, May 05 '17 at 17:53
Also, from the sample data (in the last row), "an quality" falls into english words but for us it is gibberish because it doesn't make sense. But with your suggested english_words, I think it will return true — Arman, May 05 '17 at 17:55
Good questions, and you are correct. Determining if all sentences are syntactically correct is very different from determining if words are real. i'll update the answer to provide more info. — user2263572, May 05 '17 at 17:57

Johannes Gontrum · Answer 2 · 2017-05-06T08:17:23.417

Separating good text from 'gibber' is not a trivial task, especially if you are dealing with text messages / chats (that's what it looks like to me).

A misspelled word does not make a sample unusable and even a syntactically wrong sentence should not disqualify the whole text. That's a standard you could use for newspaper texts, but not for raw, user generated content.

I would annotate a corpus in which you separate the good samples from the bad ones and train a simple classifier on in. Annotation does not have to be a big effort, since these gibberish texts are shorter than the good ones and should be easy to recognise (at least some). Also, you could try to start with a corpus size of ~100 datapoints (50 good / 50 bad) and expand it when the first model is more or less working.

This is a sample code that I always use for text classification. You need to install scikit-learn and numpy though:

import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Prepare data

def prepare_data(data):
    """
    data is expected to be a list of tuples of category and texts.
    Returns a tuple of a list of lables and a list of texts
    """
    random.shuffle(data)
    return zip(*data)

# Format training data

training_data = [
    ("good", "rain a lot the packs maybe damage."),
    ("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner.  he needs a call back from customer support "),
    ("gibber", "wh. screen"),
    ("gibber", "How will I know if I")
]
training_labels, training_texts = prepare_data(training_data)


# Format test set
test_data = [
    ("gibber", "an quality"),
    ("good", "<datapoint with valid text>",
    # ...
]
test_labels, test_texts = prepare_data(test_data)


# Create feature vectors

"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels


# Train the classifier

clf = LogisticRegression()
clf.fit(X, y)


# Test performance

X_test = vectorizer.transform(test_texts)
y_test = test_labels

# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)

# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))

# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))

# predict labels for unknown texts
data = ["text1", "text2",]
# Important: use the same vectorizer you used for the training.
# When saving the model (e.g. via pickle) always serialize
# classifier & vectorizer
X = vectorizer.transform(data)
# Now predict the labels for the texts in 'data'
labels = clf.predict(X)
# And put them back together 
result = list(zip(labels, data))
# result = [("good", "text1"), ("gibber", "text2")]

A few words about how it works: The count vectorizer tokenizes the text and creates vectors containing the counts for all words in the corpus. Based upon these vectors, the classifier tries to recognise patters to distinguish between both categories. A text with only a few and uncommon (b/c misspelled) words would rather be in the 'gibber' category, while a text with a lot of words that are typical for common sentences (think of all the stop words here: 'I', 'you', 'is'... ) is more prone to be a good text.

If this method works for you, you should also try other classifiers and use the first model to semi-automatically annotate a larger training corpus.

"Annotation does not have to be a big effort, since these gibberish texts are shorter than the good ones and should be easy to recognise". Above example "rain a lot the packs maybe damage" valid text, "How will I know if I" invalid text. Even in this small sample, distinguishing between those two wouldn't be trivial. A supervised learning algorithm based off of labelling invalids as fields with mostly "short text" would be highly prone to overfitting. I agree with your overall approach, but I think the amount of effort involved with pulling it off would be quite substantial/difficult. — user2263572, May 05 '17 at 18:48
On trying to understand this piece of code. test_data = [ #TODO... So here what do I need to do? Will be manually adding labels to the test data here? — Arman, May 05 '17 at 18:54
@Arman Add datapoints in the same format as in the training set. The performance of the classifier will be evaluated on these examples, so it is important that they are no duplicates of the training data. — Johannes Gontrum, May 05 '17 at 22:00
@user2263572 I agree that there is a good chance of overfitting in this case. However, when I encounter a problem like this, I annotate a small corpus (in this case it should not take long) to see if a classifier can make any sense out of the data. If it fails, it's time to find a more sophisticated result. — Johannes Gontrum, May 05 '17 at 22:04
Hmm, I understand that. But shouldn't the whole purpose of this be that it should be able to accurately predict which texts are good and which are bad? If I'll be putting labels myself in the test text, then how will the model tell me good vs bad text? — Arman, May 05 '17 at 22:15
@Arman I'm assuming that you have a large dataset - maybe thousands or tens of thousands of messages. If you annotate a small subset (100-200 should be enough for this task), the classifier can try to learn to separate the texts based on those samples. The test set's purpose is only to have information about the accuracy of the classifier. I'm adding a bit of code to demonstrate how you can then predict the labels of your other texts. — Johannes Gontrum, May 06 '17 at 08:10
Hi, could you suggest me other classifiers that I should try? Also, what do you mean by use the first model to semi-automatically annotate a larger training corpus? I mean what's first model here? — Arman, May 10 '17 at 17:31

How to extract rows with only meaningful text in a column

2 Answers2