I've been trying to NLP by tokenizing texts with n-gram. I have to count how many occurrences of each n-gram there is, by label A
and B
respectively.
However, I have to choose between putting a long list into a column VS getting a very long dataframe and I'm not sure which structure is superior to the other.
AFAIK, it is a bad structure to have list inside a column of a dataframe, since you can hardly get any useful information using pandas operations, like getting the frequency(occurence) of an item that's inside several lists. Also, it would require more calculations to do any tasks even if it's possible.
However, I know that a dataframe too long will eat up a lot of RAM, and even possibly kill other processes if the data gets too big to fit in the RAM. That's kind of the situation I certainly don't want to be in.
So now I have to make a choice. What I want to do is counting each ngram item's occurrence by its label.
For example, (The dataframes are shown below)
{
{ngram: hey, occurence_A: 2, occurence_B: 0},
{ngram: python, occurence_A: 2, occurence_B: 1},
...
}
I think it'll be relevant to state my computer's spec.
CPU: i3-6100
RAM: 16GB
GPU: n/a
DataFrame 1:
+------------+-------------------------------------------+-------+
| DATE | NGRAM | LABEL |
+------------+-------------------------------------------+-------+
| 2019-02-01 | [hey, hey, reddit, reddit, learn, python] | A |
| 2019-02-02 | [python, reddit, pandas, dataframe] | B |
| 2019-02-03 | [python, reddit, ask, learn] | A |
+------------+-------------------------------------------+-------+
DataFrame 2:
+------------+-----------+-------+
| DATE | NGRAM | LABEL |
+------------+-----------+-------+
| 2019-02-01 | hey | A |
| 2019-02-01 | hey | A |
| 2019-02-01 | reddit | A |
| 2019-02-01 | reddit | A |
| 2019-02-01 | learn | A |
| 2019-02-01 | python | A |
| 2019-02-02 | python | B |
| 2019-02-02 | reddit | B |
| 2019-02-02 | pandas | B |
| 2019-02-02 | dataframe | B |
| 2019-02-03 | python | A |
| 2019-02-03 | reddit | A |
| 2019-02-03 | ask | A |
| 2019-02-03 | learn | A |
+------------+-----------+-------+