0

I am trying for about 2 hours, and I'm not sure whether what I want to do even works.

I have a large file with some data that looks like

43034452      LONGSHIRTPAIETTE                                        17.30
               27.90                                    
                                             0110             


          COLOR               :                    :                    :                    :                    :
                :                    :                    :                    
             -11     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
43034453      LONG SHIRT PAI ETTE                                              16.40
               25.90                                    
                                             0110             


          COLOR               :                    :                    :                    :                    :
                :                    :                    :                    
              -3     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
43034454      BASIC                                                     4.99
                8.90                                    
                                             0110             


          COLOR               :                    :                    :                    :                    :
                :                    :                    :                    
              -5     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0

(The file has 36k rows.)

What I want to do is to get this whole thing clean.

In the end, the rows should look like

43034452;LONGSHIRTPAIETTE;17.30;27.90;0110
43034453;LONG SHIRT PAI ETTE;16.40;25.90;0110
43034454;BASIC;4.99;8.90;0110

So there is a lot of data that I don't need. I'm using Notepad++ to do my regex.

My regex string looks like ([0-9]*)\s{6,}([A-Z]*)\s*([0-9\.]*)\s*([0-9\.]*)\s*([0-9]*) at the moment.

This brings me the first number followed by 6 spaces. (It has to be like this because some rows start with FF and FF are not letters. It's some kind of sign that I can't identify but if I let Notepad++ show all signs I see FF.)

So as a result I get

\1: 43034452
\2: LONGSHIRTPAIETTE
\3: 17.30
\4: 27.90
\5: 0110

like expected, but on the next row it stops on the space. If I add \s to the pattern, then it also selects all spaces after the word part. And I obviously can't say "only one space", can I?

So my question is, can I use regex to get a selection like the one I want?

If so, what am I doing wrong?

TRiG
  • 10,148
  • 7
  • 57
  • 107
Dwza
  • 6,494
  • 6
  • 41
  • 73

3 Answers3

1

Use the below regex

([0-9]*)\s{6,}([A-Z]+(?:\s+[A-Z]+)*)\s*([0-9\.]*)\s*([0-9\.]*)\s*([0-9]*).*?(?=\n\S|$)

and then replace the match with \1;\2;\3;\4;\5

Don't forget to enable the DOTALL modifier s.

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • this pattern also matches the row with color. Actually its nice because thr spaces after the last word is cut of but there are to many rows selected – Dwza Apr 08 '15 at 14:15
  • but it gives you the expected output. – Avinash Raj Apr 08 '15 at 14:19
  • Since there are more rows selected then I want... than no :) because if i replace data with lets say semikolon, than I change data "that I don't need" and that makes it another task to select this and remove it :) So it's a Yes and a No. That's why I gave you a 1+ – Dwza Apr 08 '15 at 14:25
1

Try this:

([0-9]+)\s{6,}((?:[A-Z]+\ )+)\s*([0-9\.]+)\s+([0-9\.]+)\s+([0-9]+)

Note a few things:

  • Tightening the *s to + where this is appropriate, so you're enforcing some characters in those columns, or actual whitespace
  • The use of a non-capturing group to repeat one or more instances of a word then a space.
Community
  • 1
  • 1
declension
  • 4,110
  • 22
  • 25
  • gues I will go with this one because in this case I have only one space at the end of the last word. :) – Dwza Apr 08 '15 at 14:18
  • Is there a possibility to also look for a comma in the word ? Because I have a word like: `foo,,bar` and the comma's can stay like they are... – Dwza Apr 08 '15 at 14:50
  • @Dwza - sure, just add it to the characters, i.e. use `[A-Z,]+` – declension Apr 08 '15 at 14:53
  • This is what I was trying but no matter what I do, it doesn't match this rows... :( see [this sample](https://regex101.com/r/vU7tB9/2) – Dwza Apr 08 '15 at 14:59
  • aaarrrg... that no comma ^^ it a other sign. sorry ;D – Dwza Apr 08 '15 at 15:05
1

Your approach is correct.. just replace * with + (more than one) in your regex.

/([0-9]+)\s{6,}([A-Z ]+)\s+([0-9\.]+)\s+([0-9\.]+)\s+([0-9]+)/g

See the DEMO.

karthik manchala
  • 13,492
  • 1
  • 31
  • 55
  • this seems to work online but not in notepad++ if i use this one it skips a lot of rows, than suttenly it take 2 rows and skipps some more rows... – Dwza Apr 08 '15 at 14:12
  • please check the updated regex.. I missed space in `([A-Z ]+)` – karthik manchala Apr 08 '15 at 14:14
  • Actually ok, but the spaces at the end of the last word aren't cut of. – Dwza Apr 08 '15 at 14:17
  • can you check this? `([0-9]+)\s{6,}([A-Z ]+)\s+([0-9\.]+)\s+([0-9\.]+)\s+([0-9]+)(?=\s)` This should give you zero spaces after last word. – karthik manchala Apr 08 '15 at 14:28
  • same result :) bat thx, I have accepted nicks answere. That helped me most. Thank you for your tries :) – Dwza Apr 08 '15 at 14:30