3

I've been trying to NLP by tokenizing texts with n-gram. I have to count how many occurrences of each n-gram there is, by label A and B respectively.

However, I have to choose between putting a long list into a column VS getting a very long dataframe and I'm not sure which structure is superior to the other.

AFAIK, it is a bad structure to have list inside a column of a dataframe, since you can hardly get any useful information using pandas operations, like getting the frequency(occurence) of an item that's inside several lists. Also, it would require more calculations to do any tasks even if it's possible.

However, I know that a dataframe too long will eat up a lot of RAM, and even possibly kill other processes if the data gets too big to fit in the RAM. That's kind of the situation I certainly don't want to be in.

So now I have to make a choice. What I want to do is counting each ngram item's occurrence by its label.

For example, (The dataframes are shown below)

{

{ngram: hey, occurence_A: 2, occurence_B: 0},

{ngram: python, occurence_A: 2, occurence_B: 1},

...

}

I think it'll be relevant to state my computer's spec.

CPU: i3-6100

RAM: 16GB

GPU: n/a

DataFrame 1:

+------------+-------------------------------------------+-------+
|    DATE    |                   NGRAM                   | LABEL |
+------------+-------------------------------------------+-------+
| 2019-02-01 | [hey, hey, reddit, reddit, learn, python] | A     |
| 2019-02-02 | [python, reddit, pandas, dataframe]       | B     |
| 2019-02-03 | [python, reddit, ask, learn]              | A     |
+------------+-------------------------------------------+-------+

DataFrame 2:

+------------+-----------+-------+
|    DATE    |   NGRAM   | LABEL |
+------------+-----------+-------+
| 2019-02-01 | hey       | A     |
| 2019-02-01 | hey       | A     |
| 2019-02-01 | reddit    | A     |
| 2019-02-01 | reddit    | A     |
| 2019-02-01 | learn     | A     |
| 2019-02-01 | python    | A     |
| 2019-02-02 | python    | B     |
| 2019-02-02 | reddit    | B     |
| 2019-02-02 | pandas    | B     |
| 2019-02-02 | dataframe | B     |
| 2019-02-03 | python    | A     |
| 2019-02-03 | reddit    | A     |
| 2019-02-03 | ask       | A     |
| 2019-02-03 | learn     | A     |
+------------+-----------+-------+
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
user8491363
  • 2,924
  • 5
  • 19
  • 28
  • @jezrael I don't know if you can see this but sir, this is NOT a duplicate question. The post you linked explains HOW to explode a dataframe. What I'm asking is WHICH structure is better and WHY. Totally different. – user8491363 Feb 25 '20 at 07:40
  • So sorry, reopened. – jezrael Feb 25 '20 at 07:41
  • 1
    Btw, my opinion is second one, because [this](https://stackoverflow.com/a/52563718/2901002). But maybe depends of real data. – jezrael Feb 25 '20 at 07:44
  • 1
    It really depends on what you need to do with it after this step, how you access data etc..., in my oppinion, goes with 2nd option will fit most the case – Phung Duy Phong Feb 25 '20 at 08:17
  • Alright going for the second and if it breaks, I'll go for the first. At least now I know that the second one is ideal. – user8491363 Feb 25 '20 at 08:19
  • Second is better as it gives you more flexibility for doing any I/O. Having a list inside may pose some unintended issues. – Mahendra Singh Feb 25 '20 at 08:24

1 Answers1

2

like you mention, having a list inside a column of a dataframe is a bad structure, and long format dataframe is preferred. Let me attempt to answer the question from several aspects:

  1. Added complexity for data manipulation & lack of native support functions for list-like column

With a list-like column, you are not able to use Pandas functions readily.

For example, you mentioned you are interested in the NGRAM by LABEL. With dataframe1 (df1), you can obtain what you need readily by a simple groupby and count function, while for dataframe2 (df2) you need to explode the list-list column before you can work on them:

df1.groupby(['LABEL','NGRAM']).count().unstack(-1).fillna(0)

df2.explode(column='NGRAM').groupby(['LABEL','NGRAM']).count().unstack(-1).fillna(0)

Both give you the same thing:

enter image description here

In addition, many native Pandas functions (e.g. my favourite value_counts) can't work on list directly, so explode is almost always needed.

  1. Lower computation time for long data than list-like data (generally speaking, since we don't need to explode the column first)

Imagine you decided to capitalize your NGRAM, you would do the following respectively, and you can see that df2 takes much longer time to execute:

df1['NGRAM'] = df1['NGRAM'].str.capitalize()
# 1000 loops, best of 5: 1.49 ms per loop

df2['NGRAM'] = df2['NGRAM'].explode().str.capitalize().groupby(level=0).apply(list)
# 1000 loops, best of 5: 246 µs per loop

If memory is an issue for you, you might want to consider working with NGRAM counts per label directly (data structure in the image above, rather than storing them as either df1 or df2) or use Numpy arrays (which reduces the overhead of Pandas slightly) while keeping a NGRAM dictionary file separately.

Toukenize
  • 1,390
  • 1
  • 7
  • 11
  • Thanks for the great answer! On a side note, what do you think about "super long, comma-separated string' vs 'super long dataframe'? Because my data has a LOT of ngrams packed in one bag (such as a list), it would be inevitable to have a substantially larger dataframe if I follow DF2's structure. – user8491363 Feb 25 '20 at 10:39
  • I like how you made a super clear dataframe with label counts and actually, that's exactly what I'm trying to get in the end. However, because the raw text data is dispersed in a lot of text files and needs many NLP pre-processing steps so I'm still gonna have to deal with this structure problem first. – user8491363 Feb 25 '20 at 10:41
  • Personally I would prefer to work with a super long, comma-separated string stored as .txt files, read and process every line of the individual text files, before appending them into the dataframe. (Well, Pandas dataframe just make things easier to explore, but might not be the most memory efficient) You can explore `collections.Counter`, a specialized built-in type for Python, meant to store counts of items, and thus suitable for your n-grams storage. – Toukenize Feb 25 '20 at 15:44