Replace space with semicolon when more than one with regex

Question

I am trying for about 2 hours, and I'm not sure whether what I want to do even works.

I have a large file with some data that looks like

43034452      LONGSHIRTPAIETTE                                        17.30
               27.90                                    
                                             0110             


          COLOR               :                    :                    :                    :                    :
                :                    :                    :                    
             -11     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
43034453      LONG SHIRT PAI ETTE                                              16.40
               25.90                                    
                                             0110             


          COLOR               :                    :                    :                    :                    :
                :                    :                    :                    
              -3     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
43034454      BASIC                                                     4.99
                8.90                                    
                                             0110             


          COLOR               :                    :                    :                    :                    :
                :                    :                    :                    
              -5     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0

(The file has 36k rows.)

What I want to do is to get this whole thing clean.

In the end, the rows should look like

43034452;LONGSHIRTPAIETTE;17.30;27.90;0110
43034453;LONG SHIRT PAI ETTE;16.40;25.90;0110
43034454;BASIC;4.99;8.90;0110

So there is a lot of data that I don't need. I'm using Notepad++ to do my regex.

My regex string looks like ([0-9]*)\s{6,}([A-Z]*)\s*([0-9\.]*)\s*([0-9\.]*)\s*([0-9]*) at the moment.

This brings me the first number followed by 6 spaces. (It has to be like this because some rows start with FF and FF are not letters. It's some kind of sign that I can't identify but if I let Notepad++ show all signs I see FF.)

So as a result I get

\1: 43034452
\2: LONGSHIRTPAIETTE
\3: 17.30
\4: 27.90
\5: 0110

like expected, but on the next row it stops on the space. If I add \s to the pattern, then it also selects all spaces after the word part. And I obviously can't say "only one space", can I?

So my question is, can I use regex to get a selection like the one I want?

If so, what am I doing wrong?

score 1 · Answer 1 · answered Apr 08 '15 at 14:04

1

Use the below regex

([0-9]*)\s{6,}([A-Z]+(?:\s+[A-Z]+)*)\s*([0-9\.]*)\s*([0-9\.]*)\s*([0-9]*).*?(?=\n\S|$)

and then replace the match with \1;\2;\3;\4;\5

Don't forget to enable the DOTALL modifier s.

DEMO

answered Apr 08 '15 at 14:04

Avinash Raj

172,303
28
230
274

this pattern also matches the row with color. Actually its nice because thr spaces after the last word is cut of but there are to many rows selected – Dwza Apr 08 '15 at 14:15
but it gives you the expected output. – Avinash Raj Apr 08 '15 at 14:19
Since there are more rows selected then I want... than no :) because if i replace data with lets say semikolon, than I change data "that I don't need" and that makes it another task to select this and remove it :) So it's a Yes and a No. That's why I gave you a 1+ – Dwza Apr 08 '15 at 14:25

score 1 · Accepted Answer · edited May 23 '17 at 10:24

1

Try this:

([0-9]+)\s{6,}((?:[A-Z]+\ )+)\s*([0-9\.]+)\s+([0-9\.]+)\s+([0-9]+)

Note a few things:

Tightening the *s to + where this is appropriate, so you're enforcing some characters in those columns, or actual whitespace
The use of a non-capturing group to repeat one or more instances of a word then a space.

edited May 23 '17 at 10:24

Community

1
1

answered Apr 08 '15 at 14:05

declension

4,110
22
25

gues I will go with this one because in this case I have only one space at the end of the last word. :) – Dwza Apr 08 '15 at 14:18
Is there a possibility to also look for a comma in the word ? Because I have a word like: `foo,,bar` and the comma's can stay like they are... – Dwza Apr 08 '15 at 14:50
@Dwza - sure, just add it to the characters, i.e. use `[A-Z,]+` – declension Apr 08 '15 at 14:53
This is what I was trying but no matter what I do, it doesn't match this rows... :( see [this sample](https://regex101.com/r/vU7tB9/2) – Dwza Apr 08 '15 at 14:59
aaarrrg... that no comma ^^ it a other sign. sorry ;D – Dwza Apr 08 '15 at 15:05

karthik manchala · Answer 3 · 2015-04-08T14:13:41.500

1

Your approach is correct.. just replace * with + (more than one) in your regex.

/([0-9]+)\s{6,}([A-Z ]+)\s+([0-9\.]+)\s+([0-9\.]+)\s+([0-9]+)/g

See the DEMO.

edited Apr 08 '15 at 14:13

answered Apr 08 '15 at 14:08

karthik manchala

13,492
1
31
55

this seems to work online but not in notepad++ if i use this one it skips a lot of rows, than suttenly it take 2 rows and skipps some more rows... – Dwza Apr 08 '15 at 14:12
please check the updated regex.. I missed space in `([A-Z ]+)` – karthik manchala Apr 08 '15 at 14:14
Actually ok, but the spaces at the end of the last word aren't cut of. – Dwza Apr 08 '15 at 14:17
can you check this? `([0-9]+)\s{6,}([A-Z ]+)\s+([0-9\.]+)\s+([0-9\.]+)\s+([0-9]+)(?=\s)` This should give you zero spaces after last word. – karthik manchala Apr 08 '15 at 14:28
same result :) bat thx, I have accepted nicks answere. That helped me most. Thank you for your tries :) – Dwza Apr 08 '15 at 14:30

Replace space with semicolon when more than one with regex

3 Answers3