1

I am doing a text analysis with R and needed to convert the first letters of the sentences into lowercase while keeping the other capitalized words the way they are. So I used the command

     x <- gsub("(\\..*?[A-Z])", '\\L\\1', x, perl=TRUE)

which worked, but partially. The problem is that for the text analysis I had to convert the pdf files into txt format and now the txt files contain a lot of empty lines (page breaks, returns possibly), and therefore the command I used does not convert the capital letters that appear on the new lines. I was trying to eliminate the empty lines using different combinations in gsub with multiple \s, with \r, \n but nothing works. When I do the inspect(x) of the tm-package, the output looks in the following way:

[346]                                                                                                                                                                                                                                                  
[347]    Thank you.                                                                                                                                                                                                                                    
[348]                                                                                                                                                                                                                                                  
[349]    Vice President of Investor Relations                                                                                                                                                                                               
[350]   

I would be grateful if anyone could help me!

2 Answers2

3

Given your output, the empty lines appear to be separate character strings in a character vector. You need to filter those out using grep:

empty_lines = grepl('^\\s*$', x)
x = x[! empty_lines]

Then you can perform your subsequent analysis, but you probably still need to concatenate the lines first to get a single character string:

x = paste(x, collapse = '\n')
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • @Kohrad Rudolph Thank you! I have tried it out but I get the following error msg: `Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"` – Daria Diachenko Jun 13 '16 at 09:36
  • @Daria There is no call to “meta” in my code, so I don’t know where this error comes from. You’re either using different code, or your R session has redefined some core R constructs in very weird ways. – Konrad Rudolph Jun 13 '16 at 09:37
  • I figured it out with `x <- gsub("^\\s+|\\s+$", "", x)` command. Thank you for your help! – Daria Diachenko Jun 13 '16 at 10:01
  • 1
    Did you mean to use `grepl` instead of `grep`? (The use of `!` suggests that) An alternative might be `trimws(x) == ""` – talat Jun 13 '16 at 10:25
  • @docendodiscimus I actually meant to use `-` instead of `!`, but it comes down to the same thing. I often use `grepl` and combine the result logically with other masks, hence the mistake. – Konrad Rudolph Jun 13 '16 at 10:37
  • 1
    I personally prefer `grepl` in most situations and would also use it here, since `grep` is problematic when there are no matches, as you know. – talat Jun 13 '16 at 10:43
1

You can get the new lines using ^[A-Z] and separate the two cases with an or sign |

x <- gsub("(\\..*?[A-Z]|^[A-Z])", '\\L\\1', x, perl=TRUE)

And you can get rid of empty lines either before or after the above step with

x <- x[x != ""]
JeremyS
  • 3,497
  • 1
  • 17
  • 19
  • thank you! the latter one has worked for me! however, I still encounter the problem, an example of which I will post in the next comment. basically there are "extra" spaces left in the beginning of the lines... – Daria Diachenko Jun 13 '16 at 09:53
  • `[283] web tool. [284] No, we're not providing any specific targets for the second quarter. [285] Thank you. ` – Daria Diachenko Jun 13 '16 at 09:53
  • You can do `gsub("^ ", "", x)` to get rid of spaces at the start of a line – JeremyS Jun 14 '16 at 09:28