0

I am trying to edit my pandas dataframe based on some specifications. I need a certain layout of my cells in order for my program to work. Currently, my data looks something like this:

    x      y
A   1  information
B   2  information and some stuff
C   3  information and random stuff

But I need it to look like this:

    x      y
A   1  information
B   2  information
C   3  information

So basically, it needs to scan through every cell and if check for a keyword ("and" in my example). Then it needs to delete everything after the keyword, including the keyword, leaving only the important information behind.

I currently just can't wrap my head around an efficient way to do this. Any help is appreciated

maddes
  • 3
  • 2
  • 1
    Does your data always contain an 'and' word in the context? – Roxy Aug 30 '21 at 20:02
  • *"it needs to scan through every cell..."* No it doesn't, it only needs to search the string column(s), 'y'. So your code will simply be `df['y'] = df['y'].str.replace(pattern, replacement)`. The rest is you figuring out which regex to use. See doc for [`str.replace`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html) – smci Aug 30 '21 at 20:06
  • ...and if you want to select *all* string columns in your dataframe, use `df.select_dtypes('string')`. See [this](https://stackoverflow.com/questions/64374660/apply-transformation-only-on-string-columns-with-pandas-ignoring-numeric-data) – smci Aug 30 '21 at 20:09

3 Answers3

0

You can access the y column and use the .str API to search for and replace everything after the word 'and'.

df.y = df.y.str.replace(r' and .*', '')
James
  • 32,991
  • 4
  • 47
  • 70
0

You can split the string with the keyword by str.split(), then take the part of substring on the left by .str[0]:

df['y'] = df['y'].str.split(' and').str[0]

Result:

print(df)

   x            y
A  1  information
B  2  information
C  3  information
SeaBean
  • 22,547
  • 3
  • 13
  • 25
0

You can use string.split(" keyword ") to break up the string into a list.

import pandas as pd

# create the df to work with:
df = pd.DataFrame(
    {
        "x": [
            1,
            2,
            3,
        ],
        "y": [
            "information",
            "information and some stuff",
            "information and random stuff"
        ]
    }
)


for index in df.index:  # loop over each line
    current_line = df.loc[index, "y"]  # get current line as string
    current_line_list = current_line.split(" and ") # create a list. Example: ['information', 'some stuff']
    current_line = df.loc[index, "y"] = current_line_list[0]  # the first element will be information

Result:

print(df)

   x            y
0  1  information
1  2  information
2  3  information
lorenz-ry
  • 64
  • 3