Finding all words where the line number is the same in a pandas dataframe

Question

I have a dataframe that has these columns df['Page', 'Word', 'LineNum'].

df =
Idx Page Word LineNum
0 1 Hello 1
1 1 This 1
2 1 is 2
4 1 an 2
5 2 example 1
6 2 of 1
7 2 words 1
8 2 across 2
9 2 multiple 2
10 3 pages 1
11 3 in 1
12 3 the 1
13 4 document 1
14 4 which 1
15 4 has 1
16 4 split 1

This dataframe has been extracted from a csv file, and contains details about the document.

As you can imagine, several words appear in the same line (have the same value in LineNum), and a single page has several such lines.

This is what I want to do:

for( all the pages in the dataframe)
    if(  LineNum is the same )
        df['AllWordsInLine'] = add all the words in the df['Word'] column.

Desired output

LineDF['FullLine'] =
Idx FullLine
0 Hello This
1 is an
2 example of words
3 across multiple
4 pages in the
5 document which has split

I am just about 2 weeks into pandas, and I would much appreciate an expert's response. thank you, Venkat

Hello Barmar, this is what I tried: df['AllWordsInLine'] = df.groupby('Page')['Line']. I know this is not correct, but am not getting the right syntax. — Venkatesh, Mar 30 '18 at 16:10
Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. — MaxU - stand with Ukraine, Mar 30 '18 at 17:07

n3utrino · Answer 1 · 2018-03-30T17:26:32.470

df = pd.DataFrame({'Page':[0,0,0,1,1,1,2],
               'Word':['a','b','c','d','e','f','g'],
               'LineNum':[0,0,1,0,1,2,0]})

for line_page_tuple, subdf in df.groupby(['Page','LineNum']):
    print('Page:',line_page_tuple[0],', Line:',line_page_tuple[1],', All words in line:',
      subdf.Word.values)

# Page: 0 , Line: 0 , All words in line: ['a' 'b']
# Page: 0 , Line: 1 , All words in line: ['c']
# Page: 1 , Line: 0 , All words in line: ['d']
# Page: 1 , Line: 1 , All words in line: ['e']
# Page: 1 , Line: 2 , All words in line: ['f']
# Page: 2 , Line: 0 , All words in line: ['g']

score 0 · Accepted Answer · answered Mar 30 '18 at 17:08

I assume you want all the words across pages for each line number. In other words, you want a mapping from line number to set of words.

You can achieve this by simply grouping by LineNum and aggregating to set. Here is a minimal example:

df = pd.DataFrame({'Page':[0,0,0,1,1,1,2],
                   'Word':['a','b','a','d','e','d','g'],
                   'LineNum':[0,0,1,0,1,2,0]})

res = df.groupby('LineNum')['Word'].apply(set)

# LineNum
# 0    {b, g, a, d}
# 1          {a, e}
# 2             {d}
# Name: Word, dtype: object

Finding all words where the line number is the same in a pandas dataframe

2 Answers2