0

I'am a regex beginner and need your help with finding the right regex for my project in Notepad++. My aim is to have a regex to find & extract some strings in single quotes which were extracted from a HTML document. I need one regex to do it all and I am bound to use Notepad++.

Here's the structure of my text document (cannot use the original since it contains confidential material):

{ group: '1', code: '1111', ignored: true, shortDescription: 'This is a short "description", containing commas or quotes', description: '', document: 'documentname.txt', row: '1', original: 'this is the original text', translated: 'this is the translated text', matchRate: {label: "label", value: "value"} } _LF_
{ group: '2', code: '2222', ignored: true, shortDescription: 'This is another short "description", containing commas or quotes', description: '', document: 'documentname.txt', row: '1', original: 'this is the original text', translated: 'this is the translated text', matchRate: {label: "label", value: "value"} } _LF_
{ group: '3', code: '3333', ignored: true, shortDescription: 'This is yet another short "description", containing commas or quotes', description: '', document: 'documentname.txt', row: '1', original: 'this is the original text', translated: 'this is the translated text', matchRate: {label: "label", value: "value"} }

My documents contains 33 rows, all looking like this ("LF" in the end is a line break). "group", "code" and so on are always the same, the string in single quotes differs and also might be empty.

I need to extract all values in '' (or delete all the rest), separated by a comma (or similar) in order to put them in an excel document. I also need the line breaks, too.

Here's what I already did: I am able to find all strings in single quotes with

([^']*+'[^\r\n']*+)

although this way, also the text which comes after the ending single quote until the next beginning single quote is shown as output.

What I still need is a possibility to erase all other text, including the single quotes around these strings. I wasn't able to manage that. Here is what the result should look like:

'1', '1111', 'This is a short "description", containing commas or quotes' '', 'documentname.txt', '1', 'this is the original text', 'this is the translated text'
'2', '2222', 'This is another short "description", containing commas or quotes' '', 'documentname.txt', '1', 'this is the original text', 'this is the translated text'
'3', '3333', 'This is yet another short "description", containing commas or quotes' '', 'documentname.txt', '1', 'this is the original text', 'this is the translated text'

I also read some threads on regex like this or this, and I learned a lot (as I said, beginner speaking here...), but I didn't manage to find a solution to extract exactly the strings I need.

I would be very happy if someone could help me. Thanks a lot!

Community
  • 1
  • 1
ladyfrauke
  • 39
  • 5
  • Try `[^\n']*'([^\n']+)'[^\n']*` and replace with `\1\t` – Wiktor Stribiżew Oct 31 '16 at 13:46
  • There are two aspects to the problem (1) finding the wanted items and (2) the exact output format you want. If you show a two or three line example of the input plus the expected output from that input then your question might be answered. – AdrianHHH Oct 31 '16 at 20:58
  • @ AdrianHHH Thanks, I updated my question accordingly. @Wiktor Stribiżew: This didn't do the trick - but I think my question was too vague, hope its clearer now. – ladyfrauke Nov 01 '16 at 12:18

2 Answers2

0

You could possibly do it via 2 steps:

1.

Find : .*?(?:\s'([^']+)'|(_LF_)).*?

Replace : $1$2,

2.

Find : ,_LF_,

Replace : \r\n

That will leave you with :

1, 1111, This is a short "description", containing commas or quotes, documentname.txt, 1, this is the original text, this is the translated text

2, 2222, This is another short "description", containing commas or quotes, documentname.txt, 1, this is the original text, this is the translated text

3, 3333, This is yet another short "description", containing commas or quotes, documentname.txt, 1, this is the original text, this is the translated text, , matchRate: {label: "label", value: "value"} }

You'll then just need to trim the last one of , , matchRate: {label: "label", value: "value"} }.

This will only work if there is always _LF_ at the end of each line by the way.

Neal
  • 801
  • 1
  • 9
  • 21
  • Thanks a lot. But when I apply the regex, all commas seem to be gone which leaves me with just the values. Still one step closer than I used to be... :) – ladyfrauke Nov 22 '16 at 14:33
  • Sorry I had an unnecessary space in the second find. The edited code should work now. Not sure what's happening with your commas though. They don't get removed when I try it. Make sure your first replace is definitely `$1$2,` with the comma at the end. – Neal Nov 22 '16 at 14:54
  • I think it would work now, but as _LF_ is actually a line break, that seems to be a problem. I think I will be able to figure this out myself in a quit minute. Anyways, as I have one solution now, I want to thank you for advising :) – ladyfrauke Nov 24 '16 at 12:41
0

using notepad++ regex find and replace, make sure select regular expression mode and untick .matches newline

edited: not capturing comma within the item (only allow single comma)

find [^'\r\n]*(?:'([^'\r\n,]*),?([^'\r\n,]*)'|([\r\n]+))(,(?=.*'))?

replace with \1\2\3\4

it should get below

1,1111,This is a short "description" containing commas or quotes,,documentname.txt,1,this is the original text,this is the translated text
2,2222,This is another short "description" containing commas or quotes,,documentname.txt,1,this is the original text,this is the translated text
3,3333,This is yet another short "description" containing commas or quotes,,documentname.txt,1,this is the original text,this is the translated text

It will works only assume there is always newline at the end of line and its actual \r\n not the _LF_

Skycc
  • 3,496
  • 1
  • 12
  • 18
  • Thank you very much - that almost did the trick. Only thing left is: Sometimes, there are commas within one value. So, I don't need this: 1, 1111,This is a short "description", containing commas or quotes,, ..... in my final version, but: '1','1111','This is a short "description", containing commas or quotes','', ..... in order to not comma-separate when the comma was actually within a value instead of between them. Do you know how to archive this? – ladyfrauke Nov 22 '16 at 14:32
  • if i understand correctly, you wanna discard the comma in `This is a short "description" containing commas or quotes`, try the edited answer, it do discard the comma like the output show, but it only allow single comma in the item, else might breaks – Skycc Nov 22 '16 at 15:22
  • you can accept and/or upvote the answer if it solve your problem :) – Skycc Nov 24 '16 at 12:55
  • Accepted! Thanks for the hint, I nearly forgot. As for the upvoting: I have too less reputation, my upvote will not be visible (but I did!). – ladyfrauke Nov 24 '16 at 14:43
  • Thanks, while accepting the answer, both of our reputation get increase :) – Skycc Nov 24 '16 at 14:45