1

I have a dataframe that consists of two columns: ID and TEXT. Pretend data is below:

ID    TEXT
1     The farmer plants grain. The fisher catches tuna.
2     The sky is blue.
2     The sun is bright.
3     I own a phone. I own a book.

I am performing cleansing on the TEXT column with nltk, so I need to convert the TEXT column to a list:

corpus = df['TEXT'].tolist()

After performing the cleansing (tokenization, removing special characters, and removing stopwords), the output is a "list of lists" and looks like this:

[[['farmer', 'plants', 'grain'], ['fisher', 'catches', 'tuna']],
[['sky', 'blue']],
[['sun', 'bright']],
[['I', 'own', 'phone'], ['I', 'own', 'book']]]

I know how to get a list back into a pandas dataframe, but how do I get the list of lists back into a pandas dataframe with the ID column still assigned to the text? My desired output is:

ID    TEXT
1     'farmer', 'plants', 'grain'
1     'fisher', 'catches', 'tuna'
2     'sky', 'blue'
2     'sun', 'bright'
3     'I', 'own', 'phone'
3     'I', 'own', 'book'

I'm assuming it is something simple related to conversion between Python data structures, but I'm not sure where to start with this. The specific work product here is less important than the concept of dataframe --> native Python data structure --> do something to native Python data structure --> dataframe with original attributes intact.

Any insight you all can provide is greatly appreciated! Please let me know if I can improve my question at all!

OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75
  • Right, that's the crux of my issue. Regardless of what's done to a list, how do I maintain the attributes of a dataframe no matter where the data goes? I thought of concatenating the `ID` and `TEXT` fields, then stripping the `ID` back out later. However, that wouldn't work for sentence tokenizations because the `ID` would only be assigned to the first sentence. – OverflowingTheGlass May 11 '17 at 13:50
  • Hi, Cameron, I was wondering what method/function you are using to get the `list of list` output. – Moondra May 11 '17 at 16:50
  • It results from taking a normal list, and then sentence tokenizing the list with nltk. This results in a list of lists because each member of a normal list may have multiple sentences. – OverflowingTheGlass May 11 '17 at 16:52

1 Answers1

1

Pandas dataframes offer a lot of quick across-the-board operations, but is indeed much easier to get your hands on your data if it's not stuffed in a dataframe-- especially if you're just getting started. I certainly recommend it if you'll be working with the nltk. To keep the text and IDs together, convert your dataframe into a list of tuples. If your dataframe really has only two meaningful columns, you can do it like this:

>>> data = list(zip(df["ID"], df["TEXT"]))
>>> from pprint import pprint
>>> pprint(data)
[(265, 'The farmer plants grain. The fisher catches tuna.'),
 (456, 'The sky is blue.'),
 (434, 'The sun is bright.'),
 (921, 'I own a phone. I own a book.')]

Now if you want to work with your sentences without losing the ids, use a two-variable loop like this. (This creates the extra rows you were asking for):

sent_data = []
for id, text in data:
    for sent in nltk.sent_tokenize(text):
        sent_data.append((id, sent))

What you do depends on your application; you'll probably create a new list of two-element tuples. If you're just applying a transformation, use a list comprehension. For example:

>>> datawords = [ (id, nltk.word_tokenize(t)) for id, t in data ]
>>> print(datawords[3])
(921, ['I', 'own', 'a', 'phone', '.', 'I', 'own', 'a', 'book', '.'])

Turning a list of tuples back into a dataframe is as simple as it gets:

 newdf = pd.DataFrame(datawords, columns=["INDEX", "WORDS"])
alexis
  • 48,685
  • 16
  • 101
  • 161
  • This seems to work for simple transformations, like you pointed out. I tried it with BeautifulSoup to get text from html tags that happen to be in the TEXT field, and it balked at that (Error: expected string or bytes-like object). Anyway, thank you! And I assume converting from the tuples back to a DF is not that difficult to do. – OverflowingTheGlass May 11 '17 at 20:14
  • You should stop jumping from topic to topic! What do your BeautifulSoup type errors have to do with turning a dataframe into a list of pairs? Solve one proble at a time, ok? Anyway I added the reverse conversion, which is indeed trivial. – alexis May 11 '17 at 20:24
  • Well it was the logical next step! The first transformation I wanted to perform on the tuples was to remove the html tags that are present in some records. I'm just excited! Thank you - I really do appreciate it. – OverflowingTheGlass May 11 '17 at 20:27
  • Have fun with it. If you get stuck with cleaning your data, google it then ask a NEW question (with all irrelevant context removed). – alexis May 11 '17 at 20:29
  • Unfortunately, when you convert back to the DF as you describe, the output is not the desired output (as described in the original question) Instead of separate lines for each sentence, the sentences are still forced onto one line. This makes me worry that the same will be true when I get the stage of actually creating nGrams as well. – OverflowingTheGlass May 11 '17 at 20:36
  • I didn't answer your other question, I answered this question. Make yourself some test data with multiple rows per ID, and you'll see that the conversion respects it. I've no idea where you want to go with ngrams so I can't speak about that. – alexis May 11 '17 at 20:40
  • If you want a separate row per sentence (which I don't see why you need, tbh), just write a loop with `append()` in place of my model loop. – alexis May 11 '17 at 20:42
  • By original question, I meant this question that we are on. I have data flowing through and the conversion does not respect it. So this tuple `(921, 'I own a phone. I own a book.')`, after undergoing `sent_tokenize` and converted back to a DF, is simply `921 I own a phone. I own a book`. – OverflowingTheGlass May 11 '17 at 20:44
  • Got it. Sorry, I'm very, very new to Python. The reason I am thinking a separate row is necessary, is because once the periods are stripped out, nGrams could be found that span two sentences (last word and first word), which would not be sound. Thanks for the help. – OverflowingTheGlass May 11 '17 at 20:46
  • If you have a moment, could you please add the append statement? I tried `for i, t, in data: sent = nltk.sent_tokenize(t) data.append(sent)` – OverflowingTheGlass May 11 '17 at 20:55
  • There you go. Obviously you'll use `sent_data` for further processing, not `data`. – alexis May 11 '17 at 21:18
  • You're right about ngrams not crossing sentence boundaries, but you can do all this on the go (`sent_tokenize, word_tokenize, ngrams`) and just save the results in the same single row. You do want the ngrams from all sentences in a message (or product) to be added together. Anyway do it your way, you'll figure it out. – alexis May 11 '17 at 21:20
  • Thank you, again. I will give it a shot. I don't quite conceptually understand your point about putting the results on the same row, but I will save that for a future question. – OverflowingTheGlass May 11 '17 at 21:24