0

I have a df with journals. I have different journals.

I want to extract journals with titles below only

Blood, Cancer, Chest, Circulation, Diabetes, JAMA, Endocrinology, Gastroenterology, Gut, Medicine, Neurology, Pediatrics, Physical therapy, Radiology, Surgery, Geriatrics

Some journals have the same words - Blood circulation, Cancer History, etc. I do not want to select them.

Example

Id Title
1  Blood
2  Blood
3  Blood purification
4  Blood transfusion
5  Cancer
6  Chest
7  Cancer History
8  Chest Analysis

I want to keep the exact journal title and create new column "Influential", but cannot find the way with str.contains or str.match.

I am trying two approaches

df.loc[df['Title'].str.contains("Blood", case = True, na = False), 'Influential'] = 'Blood'
df.loc[df['Title'].str.match("Blood", case = True, na = False), 'Influential'] = 'Blood'

Expected output with the exact title of the journal:

Id Title              Influential
1  Blood              Blood
2  Blood              Blood
3  Blood purification NA
4  Blood transfusion  NA
5  Cancer             Cancer
6  Chest              Chest
7  Cancer History     NA
8  Chest Analysis     NA

Should I do it somehow via regex? Thanks.

Anakin Skywalker
  • 2,400
  • 5
  • 35
  • 63

1 Answers1

2

If you want to set Influential column values with the values from Title column if the latter is an exact match of the words in your lst list, you can use Series.isin:

df = pd.DataFrame({'Id':[1,2,3,4,5,6,7,8], 'Title': ['Blood','Blood', 'Blood purification', 'Blood transfusion', 'Cancer', 'Chest', 'Cancer History', 'Chest Analysis']})
lst = ['Blood', 'Chest', 'Cancer']
df['Influential'] = np.where(df['Title'].isin(lst), df['Title'], np.nan)
# >>> df
#    Id               Title Influential
# 0   1               Blood       Blood
# 1   2               Blood       Blood
# 2   3  Blood purification         NaN
# 3   4   Blood transfusion         NaN
# 4   5              Cancer      Cancer
# 5   6               Chest       Chest
# 6   7      Cancer History         NaN
# 7   8      Chest Analysis         NaN

Note the use of numpy.where (also suggested in the comments).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    this works but it's better to avoid `apply` when performance is an issue - my answer (in the comment on OP's post) is more efficient – Josh Friedlander Dec 02 '21 at 12:24
  • Wiktor, I updated the question slightly. What if I have multiple different title. The current solution overwrites everything besides a single title to NaN – Anakin Skywalker Dec 02 '21 at 12:41
  • 1
    @AnakinSkywalker So, if there is `Blood`, but with other text, then there must be `na`, else, the original is kept? – Wiktor Stribiżew Dec 02 '21 at 12:47
  • @Wiktor, I have a list of journals and want to create a column only for journals with the exact same name. Please see tha updated question. Thanks for your time! – Anakin Skywalker Dec 02 '21 at 12:51
  • 1
    @AnakinSkywalker `df['Influential'] = np.where(df['Title'].str.count(' ') == 0, df['Title'], np.nan)`? Or `np.where(df['Title'].str.count(r'\s') == 0, df['Title'], np.nan)` to account for any whitespace? I mean, do you only need to extract the title if it only contains one word? – Wiktor Stribiżew Dec 02 '21 at 12:52
  • I do not think it is the way. I have lst = ['Blood', "Chest', 'Cancer'] etc and I want to extract journals with these exact names only, not titles where the words above are a part of the title. – Anakin Skywalker Dec 02 '21 at 12:55
  • 1
    @AnakinSkywalker What do you mean by "exact match" here then? Only two specific words? – Wiktor Stribiżew Dec 02 '21 at 12:56
  • Wiktor, updated the initial question – Anakin Skywalker Dec 02 '21 at 13:00