7

I Have a list of strings like this:

stringlist = [JAN, jan, FEB, feb, mar]

And I have a dataframe that looks like this:

**date**            **value**
01MAR16                1
05FEB16                12
10jan17                5
10mar15                9
03jan05                7
04APR12                3

I only want to keep the dates which contain one string from stringlist in it, the result should look like this:

**date**            **value**
NA                     1
05FEB16                12
10jan17                5
10mar15                9
03jan05                7
NA                     3

Im new to using regular expression so having some trouble wrapping my head around it, would appreciate some help.

ljourney
  • 515
  • 4
  • 11

4 Answers4

5
stringlist = ["JAN", "jan", "FEB", "feb", "mar"]

m = df["date"].str.contains("|".join(stringlist))
df.loc[~m, "date"] = np.nan
print(df)

Prints:

      date  value
0      NaN      1
1  05FEB16     12
2  10jan17      5
3  10mar15      9
4  03jan05      7
5      NaN      3
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
1

You can use the Series.str.contains method as demonstrated here: Select by partial string from a pandas DataFrame:

import pandas as pd

df = pd.DataFrame({'date': ['NA', '05FEB16', '10jan17', '10mar15', '03jan05', 'NA'],
                   'value': [1, 12, 5, 9, 7, 3]})

stringlist = ['JAN', 'jan', 'FEB', 'feb', 'mar']

print(df[df['date'].str.contains('|'.join(stringlist))])

Output:

      date  value
1  05FEB16     12
2  10jan17      5
3  10mar15      9
4  03jan05      7
Red
  • 26,798
  • 7
  • 36
  • 58
1

Another play on regular expressions is to extract the characters (assumption here is that the months will alway be sandwiched between day and year), then check if each extract can be found in stringlist:

(df.assign(months = df.date.str.extract(r'([a-zA-Z]+)'), 
           date = lambda df: df.where(df.months.isin(stringlist))
          )
   .iloc[:, :-1]
)

      date  value
0      NaN      1
1  05FEB16     12
2  10jan17      5
3  10mar15      9
4  03jan05      7
5      NaN      3
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
0

You can simply do this:


import pandas as pd

keys = ["JAN", "jan", "FEB", "feb", "mar"]
date = ["01MAR16", "05FEB16", "10jan17", "10mar15", "03jan05", "04APR12"]
value = [1, 12, 5, 9, 7, 3]

df = pd.DataFrame({"date": date,
                   "value": value})

for i in range(len(df)):
    for key in keys:
        if key in df["date"][i]:
            print(df["date"][i], df["value"][i])
            break

Output

05FEB16 12
10jan17 5
10mar15 9
03jan05 7
Funpy97
  • 282
  • 2
  • 9