0

I am trying to select rows where the "story" column contains any of the strings in my list "selected_words".

I've tried several options including isin and str.contains but I generally only get errors or else an empty dataframe.

df4=pd.read_csv("https://drive.google.com/file/d/1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW/view?usp=sharing")
df4["story"] = df4["story"].astype(str) 
selected_words = ['accept', 'believe', 'trust', 'accepted', 'accepts',\
'trusts', 'believes', 'acceptance', 'trusted', 'trusting', 'accepting',\ 'believes', 'believing', 'believed', 'normal', 'normalize', ' normalized',\ 'routine', 'belief', 'faith', 'confidence', 'adoption', \
'adopt', 'adopted', 'embrace', 'approve', 'approval', 'approved', 'approves']
#At this point I am lost as to what to do next

I get either an empty dataframe, or an error message depending on what I try to do.

  • can you show us what you've tried and share the results (either empty df or the traceback?) – wpercy Nov 04 '19 at 21:27
  • Part of the issue may be this part of the code: `df4=pd.read_csv("https://drive.google.com/file/d/1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW/view?usp=sharing")` While that URL displays CSV data in a browser, it doesn't return CSV data when called with an HTTP client. – dmmfll Nov 04 '19 at 21:36

3 Answers3

1

Try this. I'm unable to load your DF.

df4[df4["story"].isin(selected_words)]
Adam Zeldin
  • 898
  • 4
  • 6
0

Here you can see a solution https://stackoverflow.com/a/26577689/12322720

Basically str.contains supports Regular Expressions so you can concatenate with an or pipe

df4[df4.story.str.contains('|'.join(selected_words))]
ZZ4
  • 56
  • 1
  • 5
  • Thanks for the suggestion, this gave me a list of booleen returns, so I slightly modified it to this: words = df4[df4.story.str.contains('|'.join(selected_words))] and got the necessary result. Thanks so much!!!!! – Jesse-Burton Nicholson Nov 04 '19 at 21:37
0

I'm currently learning more Pandas myself so I wanted to contribute an answer I just learned from a book.

It's possible to create a "mask" using a Pandas Series and use that to filter the Dataframe.

import pandas as pd

# This URL doesn't return CSV.
CSV_URL = "https://drive.google.com/open?id=1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW"
# Data file saved from within a browser to help with question.

# I stored the BitcoinData.csv data on my Minio server.
df = pd.read_csv("https://minio.apps.selfip.com/mymedia/csv/BitcoinData.csv")


selected_words = [
    "accept",
    "believe",
    "trust",
    "accepted",
    "accepts",
    "trusts",
    "believes",
    "acceptance",
    "trusted",
    "trusting",
    "accepting",
    "believes",
    "believing",
    "believed",
    "normal",
    "normalize",
    " normalized",
    "routine",
    "belief",
    "faith",
    "confidence",
    "adoption",
    "adopt",
    "adopted",
    "embrace",
    "approve",
    "approval",
    "approved",
    "approves",
]

# %%timeit run in Jupyter notebook

mask = pd.Series(any(word in item for word in selected_words) for item in df["story"])

# results 18.2 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# %%timeit run in Jupyter notebook

df[mask]

# results: 955 µs ± 6.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# %%timeit run in Jupyter notebook

df[df.story.str.contains('|'.join(selected_words))]

# results 129 ms ± 738 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# True for all
df[mask] == df[df.story.str.contains('|'.join(selected_words))]

# It is possible to calculate the mask inside of the index operation though of course a time penalty is taken rather than using the calculated and stored mask.

# %%timeit run in Jupyter notebook

df[[any(word in item for word in selected_words) for item in df["story"]]]

# results 18.2 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# This is still faster than using the alternative `df.story.str.contains`

#

The mask way of searching is significantly faster.

dmmfll
  • 2,666
  • 2
  • 35
  • 41