6

I have a very large file containing thousands of sentences. In all of them, the first word of each sentence begins with lowercase, but I need them to begin with uppercase.

I looked through the site trying to find a regex to do this but I was unable to. I learned a lot about regex in the process, which is always a plus for my job, but I was unable to find specifically what I am looking for.

I tried to find a way of compiling the code from several answers, including the following:

But for different reasons none of them served my purpose.

I am working with a translation-specific application which accepts regex.

Do you think this is possible at all? It would save me hours of tedious work.

glhr
  • 4,439
  • 1
  • 15
  • 26
CanoEE
  • 63
  • 1
  • 1
  • 3
  • 2
    Match `^[^a-z]*[a-z]` and replace with `\U$0`? (you'll need an editor that supports `\U`) – CertainPerformance Apr 17 '19 at 06:05
  • If your editor doesn't support it, it's quite easy to write a script (eg. in Python) to achieve what you want. – glhr Apr 17 '19 at 06:10
  • @CertainPerformance it looks like it does not support \U :/. Also, when using the ^[^a-z]*[a-z] regex I get every single lowercase character, not only the first lowercase character in the first word of each line. I don't know if my editor is somehow limited. – CanoEE Apr 17 '19 at 06:15
  • If your editor matches *all* lowercase characters with that pattern, your editor's regex engine is probably broken. The pattern I posted worked fine for me. It would greatly help if you posted an example input and expected output – CertainPerformance Apr 17 '19 at 06:17
  • Expected input: regex can make my job easier Expected output: Regex can make my job easier Thanks again! – CanoEE Apr 17 '19 at 06:19

2 Answers2

6

You can use this regex to search for the first letters of sentences:

(?<=[\.!?]\s)([a-z])

It matches a lowercase letter [a-z], following the end of a previous sentence (which might end with one of the following: [\.!?]) and a space character \s.

Then make a substitution with \U$1.

It doesn't work only for the very first sentence. I intentionally kept the regex simple, because it's easy to capitalize the very first letter manually.

Working example: https://regex101.com/r/hqwK26/1

UPD: If your software doesn't support \U, you might want to copy your text to Notepad++ and make a replacement there. The \U is fully supported, just checked.

UPD2: According to the comments, the task is slightly different, and just the first letters of each line should be capitalized.

There is a simple regex for that: ^([a-z]), with the same substitution pattern.

Here is a working example: https://regex101.com/r/hqwK26/2

Ildar Akhmetov
  • 1,331
  • 13
  • 22
  • Thanks Ildar, it looks like this is pretty close to what I need. But all sentences begin with the word lowercase and I need each first word to be uppercased, and not the ones after the end of a previous sentence. Do you think it's possible? Right, my software does not support \U. I would need to find an alternative solution. – CanoEE Apr 17 '19 at 06:32
  • @CanoEE Can you give an example? It's a bit unclear. – Ildar Akhmetov Apr 17 '19 at 07:09
  • Yes, it would be as follows: Expected input: regex can make my job easier Expected output: Regex can make my job easier It's just a list of sentences starting with lowercase and ending with period. I just need to change the first letter of the first word of each sentence (regex in the example) from lowercase into uppercase (Regex would be the expected output. Thanks again! – CanoEE Apr 17 '19 at 07:50
  • @CanoEE Please check the example here: https://regex101.com/r/hqwK26/2 . If it is what you need, I'll go ahead and update the answer. – Ildar Akhmetov Apr 17 '19 at 07:54
  • Awesome! It is exactly that, thanks a million! My software does not allow me the use of \u, I guess there is no other way to do it without copying to Notepad, am I right? Really thankful for your help!!! – CanoEE Apr 17 '19 at 08:22
  • @CanoEE Yes, you should just copy/paste to Notepad++ or something else. I updated the answer, please mark it as accepted :) – Ildar Akhmetov Apr 17 '19 at 08:27
  • 1
    \U is printing as it is in my code. I am substituting `\U$1`. and my expression is `([\.!?])(\s*)([a-z])`. How can I resolve this? – Tanmay Bairagi Jul 21 '20 at 14:33
1

Taking Ildar's answer and combining both of his patterns should work with no compromises. (?<=[\.!?]\s)([a-z])|^([a-z]) This is basically saying, if first pattern OR second pattern. But because you're now technically extracting 2 groups instead of one, you'll have to refer to group 2 as $2. Which should be fine because only one of the patterns should be matched. So your substitution pattern would then be as follows... \U$1$2

Here's a working example, again based on Ildar's answer... https://regex101.com/r/hqwK26/13

MShoukry
  • 11
  • 1
  • Welcome to stack overflow. Please consider posting your example in the space provided here instead of an external link. – sao Feb 15 '20 at 17:47