Creation of extra column that increases the value after each blank row using pandas

Question

I have a csv file like blew:

word   tag
w1     t1
w2     t2
w3     t3

w4     t4
w5     t5
w6     t6
w7     t7

w8     t8
w9     t9

That I want to add a column named sentence number and How to value sentences shown in below.

Desired output:

sentence#    word   tag
sentence:1   w1     t1
             w2     t2
             w3     t3
  
sentence:2   w4     t4
             w5     t5
             w6     t6
             w7     t7
    
sentence:3   w8     t8
             w9     t9

When we reach a blank row, one will be added to the previous value. I want something like this. How to reach to my desired output above?

Code:

from csv import reader

i = 0
with open('username.csv', 'rt', encoding='utf-8') as f:
csv_reader = pd.read_csv(f, delimiter=';')
csv_reader1 = reader(f)

for line in csv_reader1:
    if not line:
        i+=1 # empty lines
    else:
        csv_reader["sentence#"] = i
        print(line)

Welcome to SO! Please take a moment to read about how to post pandas questions: http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — YOLO, Aug 07 '20 at 21:20
Does this answer your question? [Pandas conditional creation of a series/dataframe column](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column) — Trenton McKinney, Aug 07 '20 at 23:45

Mike67 · Answer 1 · 2020-08-08T14:18:37.170

Since you're just adding text to the start of the line, there's no need to process the file as a CSV. Just read the file and insert starting text as needed:

newblock = True
sout = ""
i = 0
with open('words.txt', 'rt', encoding='utf-8') as f:
     for line in f:
          if (i == 0):  
              sout = "sentence#;" + line  # header
              i = 1
          elif (line.strip() == ""):  # blank line
              sout += '\n'
              newblock = True  # next line is new sentence
          elif newblock:  # new sentence
              sout += "sentence:" + str(i) + ";" + line  # include counter
              i+=1
              newblock = False  # wait for next blank line
          else:
              sout += ";" + line  # copy existing line          
print(sout)

score 2 · Accepted Answer · answered Aug 07 '20 at 22:10

A pandas solution

Since you are using the empty rows to separate sentences, you need to be aware that pd.read_csv has a parameter skip_blank_lines that defaults to True. Just set it to False so we can use those lines.

Secondly, it is generally a better idea to perform full column or full row operations as opposed to looping (it is faster and in some cases it uses less memory). For this to work you need to find a pattern that repeats on the full row: our aforementioned blank lines.

The sample data

import io
fo = io.StringIO('''word;tag
w1;t1
w2;t2
w3;t3

w4;t4
w5;t5
w6;t6
w7;t7

w8;t8
w9;t9''')

df = pd.read_csv(fo, skip_blank_lines=False)
fo.close()

The code

df.insert(0, column='sentence', value=df.word.isna().cumsum()+1)
    # breakdown
    # .isna will mark True on all empty rows
    # .cumsum will create the increasing integer id for each sentence
df.dropna(subset=['word'], inplace=True)

# if you really need to include the prefix 'sentence:' on each row
df.sentence = 'sentence:' + df.sentence.astype(str)

Voilá

      sentence word tag
0   sentence:1   w1  t1
1   sentence:1   w2  t2
2   sentence:1   w3  t3
4   sentence:2   w4  t4
5   sentence:2   w5  t5
6   sentence:2   w6  t6
7   sentence:2   w7  t7
9   sentence:3   w8  t8
10  sentence:3   w9  t9

Creation of extra column that increases the value after each blank row using pandas

2 Answers2