How can I extract sentences with certain text in a spreadsheet?

Question

I got a spreadsheet which looks like this. I will like to keep the file column, but extract only the sentences with the word "India". Is there a way to do that? Prefer to use KNIME or R, but happy with any solution.

Only the sentences with "India" is extracted, but the file column is kept.

A general rule of thumb is that you should always produce a minimum reproducible example of your problem. You can find information about MREs here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — flxflks, May 11 '23 at 11:21

L Tyrone · Answer 1 · 2023-05-11T05:56:04.997

This can be achieved using the dplyr and str_detect() from the stringr package. Note that "India | india" in the following code will capture both "India" and the grammatically incorrect "india" in case it exists:

library(dplyr)
library(stringr)

# Some example data
df <- data.frame(File = c(1356, 1548, 1600, 1601),
                 Text = c("Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by i",
                          "The textile industry in India traditionally, after agriculture, is the only industry that has generated huge employment for both skilled and unskilled labour. The textile industry conti",
                          "Some other text",
                          "This string has india without a capital I."))

df <- df %>%
  filter(str_detect(Text, "India | india"))

df
#   File   Text
# 1 1356   Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by i
# 2 1548   The textile industry in India traditionally, after agriculture, is the only industry that has generated huge employment for both skilled and unskilled labour. The textile industry conti
# 3 1601   This string has india without a capital I.

score 0 · Answer 2 · answered May 11 '23 at 06:31

0

We can use base R with grepl

subset(df, grepl("India", Text, ignore.case = TRUE))

answered May 11 '23 at 06:31

akrun

874,273
37
540
662

How can I extract sentences with certain text in a spreadsheet?

2 Answers2