0

I want to get the first half of a string from a pandas dataframe column, where the length varies row by row. I have searched around and found questions like this but the solutions all focus on delimeters and regular expressions. I don't have a delimiter - I just want the first half of the string, however long it is.

I can get as far as specifying the string length I want:

import pandas as pd

eggs = pd.DataFrame({"id": [0, 1, 2, 3],
                     "text": ["eggs and spam", "green eggs and spam", "eggs and spam2", "green eggs"]})

eggs["half_length"] = eggs.text.str.len() // 2

and then I want to do something like eggs["truncated_text"] = eggs["text"].str[:eggs.half_length]. Or is defining this column the wrong way to go in the first place? Can anyone help?

Tom Wagstaff
  • 1,443
  • 2
  • 13
  • 15
  • what is your definition of first half, is and included in the count? if you have three words how would you define half? – Ade_1 May 23 '21 at 21:33

2 Answers2

1

You can apply a function to text column:

import pandas as pd

eggs = pd.DataFrame({"id": [0, 1, 2, 3],
                     "text": ["eggs and spam", "green eggs and spam", "eggs and spam2", "green eggs"]})

eggs['truncated_text'] = eggs['text'].apply(lambda text: text[:len(text) // 2])

Output

|   id | text                | truncated_text   |
|-----:|:--------------------|:-----------------|
|    0 | eggs and spam       | eggs a           |
|    1 | green eggs and spam | green egg        |
|    2 | eggs and spam2      | eggs an          |
|    3 | green eggs          | green            |
Kafels
  • 3,864
  • 1
  • 15
  • 32
1

You can do this using vectorized operations, which is faster than the .apply method. I read this interesting article which explains vectorized operations more in-depth https://realpython.com/fast-flexible-pandas/

An example of using vectorized operations for strings can be found in the following post: Pandas make new column from string slice of another column

dmm98
  • 101
  • 1
  • 3