I have a dataframe (df
) from which I wish to delete every row, where a column (df$a
), has as the first word a lowercase word. I suppose this is a solution involving regex, but I have very little experience with them. I've also looked at the lettercase
and textclean
packages but was unable to find a concrete illustration for me needs. Thank you!
Asked
Active
Viewed 983 times
1

Momchill
- 417
- 6
- 15
2 Answers
2
We can use grepl
df[!grepl("^[a-z]+\\b", df$a),, drop = FALSE]

akrun
- 874,273
- 37
- 540
- 662
-
2Shouldn't it be `df$a`? – prosoitos Oct 21 '18 at 19:25
-
Hi, sorry for the belated reply. I am also sorry to say that your solutions both do not seem to produce the output I am looking for. The regex as it stands removes all kinds of rows, including the ones that start with a word with an uppercase letter. – Momchill Nov 19 '18 at 12:37
-
Also the addition of @prosoitos introduces an error "incorrect number of dimensions" – Momchill Nov 19 '18 at 13:30
-
@Momchill It would be better if you can provide a small reproducible example – akrun Nov 19 '18 at 18:49
-
1@akrun, I just discovered that the dataframe has a whole bunch of records that are in different languages, which R doesn't read too well. Let me resolve this and I'll come back with an example. Thank you for the interest! – Momchill Nov 19 '18 at 19:02
-
I posted something that should work regardless of the languages you have – prosoitos Nov 19 '18 at 19:32
2
library(tidyverse)
Toy example with a mix of upper and lower case values:
df <- tibble(
a = c("Value1", "value2", "Value3"),
b = c("value4", "Value5", "value6"),
c = c("value7", "value8", "value9"),
d = 1:3
)
df
# A tibble: 3 x 4
a b c d
<chr> <chr> <chr> <int>
1 Value1 value4 value7 1
2 value2 Value5 value8 2
3 Value3 value6 value9 3
Code
Base R:
df[!grepl("^[:lower:].*$", df$a), ]
Tidyverse:
df[!str_detect(df$a, "^[:lower:].*$"), ]
Result
# A tibble: 2 x 4
a b c d
<chr> <chr> <chr> <int>
1 Value1 value4 value7 1
2 Value3 value6 value9 3
Note that this also works if you have several words per value (since you only care about the first character of the first word, it doesn't matter whether there are word boundaries):
df <- tibble(
a = c("Word1 and other words", "word2 AND others", "Word3 And Other Words"),
b = c("word4", "Word5", "word6"),
c = c("word7", "word8", "word9"),
d = 1:3
)
df[!grepl("^[:lower:].*$", df$a), ]
# A tibble: 2 x 4
a b c d
<chr> <chr> <chr> <int>
1 Word1 and other words word4 word7 1
2 Word3 And Other Words word6 word9 3

prosoitos
- 6,679
- 5
- 27
- 41
-
I don't know why, but your suggestion works perfectly with your example and behaves just as expected, but when I use it on my dataframe/tibble/datatable (tried all) it produces wrong output. Not only that, but it doesn't seem to matter whether or not I remove the NOT operator (`!`) - for some records they get selected anyway, which is really strange. I am a bit lost as to why this might be happening. otherwise your suggestion is brilliant and simple. – Momchill Nov 19 '18 at 21:27
-
Also the two methods you've articulated - base R and tidyverse produce different output. Is it possible something is terribly wrong with my source data? – Momchill Nov 19 '18 at 22:17
-
You need to create a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) when you ask questions. Your data may have some characteristics that your description (and thus my toy example) did not cover. It does not mean that there is anything wrong with it. Simply that it is more complex than the way you described it. Without more information about your data, I cannot help you more. – prosoitos Nov 20 '18 at 01:11