-1

I have a dataframe that has these columns df['Page', 'Word', 'LineNum'].

df =
Idx Page Word LineNum
0 1 Hello 1
1 1 This 1
2 1 is 2
4 1 an 2
5 2 example 1
6 2 of 1
7 2 words 1
8 2 across 2
9 2 multiple 2
10 3 pages 1
11 3 in 1
12 3 the 1
13 4 document 1
14 4 which 1
15 4 has 1
16 4 split 1

This dataframe has been extracted from a csv file, and contains details about the document.

As you can imagine, several words appear in the same line (have the same value in LineNum), and a single page has several such lines.

This is what I want to do:

for( all the pages in the dataframe)
    if(  LineNum is the same )
        df['AllWordsInLine'] = add all the words in the df['Word'] column.

Desired output

  1. LineDF['FullLine'] =
    Idx FullLine
    0 Hello This
    1 is an
    2 example of words
    3 across multiple
    4 pages in the
    5 document which has split

I am just about 2 weeks into pandas, and I would much appreciate an expert's response. thank you, Venkat

Venkatesh
  • 133
  • 2
  • 12
  • `groupby` should work, show what you tried. – Barmar Mar 30 '18 at 16:05
  • Hello Barmar, this is what I tried: df['AllWordsInLine'] = df.groupby('Page')['Line']. I know this is not correct, but am not getting the right syntax. – Venkatesh Mar 30 '18 at 16:10
  • Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. – MaxU - stand with Ukraine Mar 30 '18 at 17:07

2 Answers2

0
df = pd.DataFrame({'Page':[0,0,0,1,1,1,2],
               'Word':['a','b','c','d','e','f','g'],
               'LineNum':[0,0,1,0,1,2,0]})

for line_page_tuple, subdf in df.groupby(['Page','LineNum']):
    print('Page:',line_page_tuple[0],', Line:',line_page_tuple[1],', All words in line:',
      subdf.Word.values)

# Page: 0 , Line: 0 , All words in line: ['a' 'b']
# Page: 0 , Line: 1 , All words in line: ['c']
# Page: 1 , Line: 0 , All words in line: ['d']
# Page: 1 , Line: 1 , All words in line: ['e']
# Page: 1 , Line: 2 , All words in line: ['f']
# Page: 2 , Line: 0 , All words in line: ['g']
n3utrino
  • 1,160
  • 1
  • 8
  • 16
0

I assume you want all the words across pages for each line number. In other words, you want a mapping from line number to set of words.

You can achieve this by simply grouping by LineNum and aggregating to set. Here is a minimal example:

df = pd.DataFrame({'Page':[0,0,0,1,1,1,2],
                   'Word':['a','b','a','d','e','d','g'],
                   'LineNum':[0,0,1,0,1,2,0]})

res = df.groupby('LineNum')['Word'].apply(set)

# LineNum
# 0    {b, g, a, d}
# 1          {a, e}
# 2             {d}
# Name: Word, dtype: object
jpp
  • 159,742
  • 34
  • 281
  • 339