0

I have scraped some data and stored it in a data frame. Some rows contain unwanted information within square brackets. Example "[N] Team Name". I want to keep just the part containing the team name, so first I use the code below to remove the brackets and any text contained within them

gsub( " *\\(.*?\\) *", "", x)

This leaves me with " Team Name" (notice the space before the T). Now I am trying to remove the white space before the T using trimws or the method shown here, but it is not working

could someone please help me with removing the extra white space.

Note: if I write the string containing the space manually and apply trimws on it, it works. However when obtaining the string directly from the data frame it doesnt. Also when running the code snippet below (where df[1,1] is the same string retreived from the data frame), I get FALSE. This gives me reason to believe that the string in the data frame is not the same as the manually typed string.

" team name" == df[1,1]
Community
  • 1
  • 1
ganninu93
  • 1,551
  • 14
  • 24
  • 2
    `trimws(" Team Name")` works for me – mtoto Jun 06 '16 at 08:05
  • Can you add e.g. `dput(utf8ToInt(x))` to your post - maybe it's a non-whitespace/tab/line-break character... Otherwise I'd say go with sth like `gsub( "\\[[^]]*\\]\\W*", "", "[N] Team Name")`... – lukeA Jun 06 '16 at 08:05
  • thanks @lukeA. The gsub solution worked for me. Could you please post it as an answer so that I could mark it as the solution. – ganninu93 Jun 06 '16 at 08:32
  • Sure, I've done so. – lukeA Jun 06 '16 at 08:54

3 Answers3

2

You could try

gsub( "\\[[^]]*\\]\\W*", "", "[N] Team Name")
lukeA
  • 53,097
  • 5
  • 97
  • 100
1

We can use

sub(".*\\]\\s+", "", x)
#[1] "Team Name"

Or just

sub("\\S+\\s+", "", x)
#[1] "Team Name"

data

x <- '[N] Team Name';
akrun
  • 874,273
  • 37
  • 540
  • 662
  • how do we do it on list of dataframes ? When I read the files as character column, I see `\xa0`. I am doing `sdf %>% purrr::map(~purrr::map_df(., ~trimws(.)))`. Then I get an error `Error in sub(re, "", x, perl = TRUE) : input string 34 is invalid UTF-8` – user5249203 Feb 23 '21 at 18:33
  • @user5249203 if it is a list of data.frame, use `lapply` i.e. `lapply(lst1, function(dat) {dat$x <- sub("\\S+\\s+", "", x); dat})` – akrun Feb 23 '21 at 18:35
  • Thank you for the response. I am doing sdf %>% purrr::map(~purrr::map_df(., ~trimws(.))). Then I get an error Error in sub(re, "", x, perl = TRUE) : input string 34 is invalid UTF-8 – user5249203 Feb 23 '21 at 18:35
  • 1
    @user5249203 sorry, didn't see your edit earlier. you can use `sdf %>% map(~ .x %>% mutate(across(everything(), trimws)))` – akrun Feb 23 '21 at 18:36
0

You should be able to remove the bracketed piece as well as any following whitespace with a single regex substitution. Your regex is correct as-is, and should successfully accomplish this. (Note: I've ignored the unexplained discrepancy between your use of parentheses vs. square brackets in your question. I've assumed square brackets for my answer.)

Strangely, this seems to be a case where the default regex engine is failing, but adding perl=T gets it working:

x <- '[N] Team Name';
gsub(' *\\[.*?\\] *','',x);
## [1] " Team Name"
gsub(perl=T,' *\\[.*?\\] *','',x);
## [1] "Team Name"

In the past I have run across cases where the default regex engine flakes out, but I have never encountered this with perl=T, so I suggest you use that. I really think there is something broken in the default regex implementation.

bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • How about `gsub("^\\s*\\[.*\\]\\s*", "", x)` – talat Jun 06 '16 at 08:10
  • @docendodiscimus You seem to have omitted the non-greedy modifier `?`, which seems to make the substitution work for this test case, but I'm sure the OP requires non-greediness for the bracketed extent, so that is not a solution. – bgoldst Jun 06 '16 at 08:12
  • I don't see when or where this might be a problem since `*` will match zero or more times. Do you have an example? – talat Jun 06 '16 at 08:16
  • Here's an example: `[N] Team [M] Name`. A greedy bracketed extent would gobble up everything from the `[N]` through the `[M]`, including the `Team` piece, which I'm sure the OP would not want. – bgoldst Jun 06 '16 at 08:20
  • 1
    And @docendodiscimus, this is besides the point, since from a point of view of *correctness*, the OP's regex should accomplish what we wants. The default regex engine is failing to do what it is required to do according to the definition of regular expressions. Getting the default regex engine to work for you by fiddling with different variations on the same regex is not a good way to address this problem. – bgoldst Jun 06 '16 at 08:23
  • Thank a lot for your help, however your solutions managed to remove the bracketed part but not the white space. What worked for me was the solution provided by LukeA which uses \\W – ganninu93 Jun 06 '16 at 08:35