0

Let me explain. My dflook like this:

id `  text                             c1      
1     Hello world how are you people    1 
2     Hello people I am fine  people    1
3     Good Morning people               -1
4     Good Evening                      -1

c1 contains only two values 1 or -1

Now I want a dataframe (output) like this:

Word      Totalcount     Points      PercentageOfPointAndTotalCount

hello        2             2              100
world        1             1              100
how          1             1              100
are          1             1              100
you          1             1              100
people       3             1              33.33
I            1             1              100
am           1             1              100
fine         1             1              100
Good         2             -2            -100
Morning      1             -1            -100
Evening      1             -1            -100

Here, Totalcount is the total times each word appears in text column.

points is the sum of c1 of each word. Example: people word is in two rows where c1 is 1, and one row where c1 is -1. So it's point is just 1 (2-1 = 1).

PercentageOfPointAndTotalCount = Points/TotalCount*100

print(df)

      id comment_text  target
0  59848  Hello world    -1.0
1  59849  Hello world    -1.0
john doe
  • 435
  • 1
  • 5
  • 12

2 Answers2

3

I am using unnesting after str.split, then we just need groupby + agg

unnesting(df,['text']).groupby('text').c1.agg(['count','sum'])
Out[873]: 
         count  sum
text               
Evening      1   -1
Good         2   -2
Hello        2    2
I            1    1
Morning      1   -1
am           1    1
are          1    1
fine         1    1
how          1    1
people       4    2
world        1    1
you          1    1

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I am getting this error `ValueError: zero-dimensional arrays cannot be concatenated` on `pd.DataFrame` line – john doe Apr 18 '19 at 14:29
  • 1
    @johndoe you have empty line in text ? , just remove it :-) before unnesting , like `df=df[df.text!='']` – BENY Apr 18 '19 at 14:30
  • 1
    @johndoe your sample data is work fine, and you may need to find the different between the data you show to us with your real data – BENY Apr 18 '19 at 14:39
  • Thanks, let me find the difference. My real data contain millions of rows – john doe Apr 18 '19 at 14:40
  • I just called 2 rows,and still got the same error. I did print(df) and see the updated question to see the result – john doe Apr 18 '19 at 15:19
1

Here's a self-contained version:

new_df = (df.set_index('c1').text.str.split().apply(pd.Series)
      .stack().reset_index().drop('level_1', axis=1))

new_df.groupby(0).c1.agg(['sum','count'])

Output:

+---------+-----+-------+
|         | sum | count |
+---------+-----+-------+
|    0    |     |       |
+---------+-----+-------+
| Evening |  -1 |     1 |
| Good    |  -2 |     2 |
| Hello   |   2 |     2 |
| I       |   1 |     1 |
| Morning |  -1 |     1 |
| am      |   1 |     1 |
| are     |   1 |     1 |
| fine    |   1 |     1 |
| how     |   1 |     1 |
| people  |   2 |     4 |
| world   |   1 |     1 |
| you     |   1 |     1 |
+---------+-----+-------+
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74