0

I've been looking for this one all day now, this is the closest useful ref I found.

My problem: huge files are imported from a closed system (can't be altered at the source) and need to be imported. These files are | separated and have a CRLF at the end of each line (until the last one). Now they found it funny to include a new type that can contain text with CR and CRLF in the text (instedd of <br>).

So what I need to do before I can process this file in our system, is to replace all CRLF and CR occurrences that are not preceded by a | to <br>, so that every line starts with a code like 000| ... 600|

Closest I've got in Notepad ++: Find: (?<![\|])[\r\n]+$

Replace: <br>

The prroblem is that it will not give a <br> for every crlf, misses crlf after cr... Other attempts to select the |crlf too forget the CR altogether.

Any thoughts greatly appreciated. Do keep in mind that the file can be over 500MB (complicating things a bit)

Extract of the file:

000|709076|153943|11||1|CRLF 
300|709076|153943|11|4|20000729||Majo509|CRLF 
500|709076|153943|11|6|3-3BNME|20000729|||21.13|4||20120509|CRLF 
600|709076|153943|11||SBV|7103||||20120509|CRLF 
600|709076|153943|11||SBV|7105||||20120509|CRLF 
600|709076|153943|11||SBV|7607||||20120509|CRLF 
600|709076|153943|11||MC||EVALUATIEROOSTER NIET INGEVULD :CR
CRLF 
------------------------------CR
CRLF 
CRLF 
Gezien U het evaluatierooster niet heeft ingevuld, blijft CR
CRLF 
CRLF 
|||20120509|CRLF 
600|709076|153943|11||SBV|7517||||20120509|CRLF 
000|709209|154072|9||1|Dne|LA1349|3100||L|20120509|CRLF 
300|709209|154072|9|3|20HEM-AT20120509|CRLF 
500|709209|154072|9|6|3-3BNME|20000908|||15.4|3||20120509|CRLF 
600|709209|154072|9||SBV|7103||||20120509|CRLF 
600|709209|154072|9||MC||AFSCHAFFING VAN DE EVOOR HET CR
CRLF 
(DE) GEBOUW(EN) CR
CRLF 
CR
CRLF 
indien U huurder of gebruiker bent.|||20120509|CRLF 
600|709209|154072|9||MC||DIEFSTAL  CRLF 

...

Required result: (rough copy paste job ;))

000|709076|153943|11||1|CRLF 
300|709076|153943|11|4|20000729||Majo509|CRLF 
500|709076|153943|11|6|3-3BNME|20000729|||21.13|4||20120509|CRLF 
600|709076|153943|11||SBV|7103||||20120509|CRLF 
600|709076|153943|11||SBV|7105||||20120509|CRLF 
600|709076|153943|11||SBV|7607||||20120509|CRLF 
600|709076|153943|11||MC||EVALUATIEROOSTER NIET INGEVULD :<BR><BR>---------------------<BR><BR><BR>Gezien U het evaluatierooster niet heeft ingevuld, blijft <BR><BR>||20120509|CRLF 
600|709076|153943|11||SBV|7517||||20120509|CRLF 
000|709209|154072|9||1|Dne|LA1349|3100||L|20120509|CRLF 
300|709209|154072|9|3|20HEM-AT20120509|CRLF 
500|709209|154072|9|6|3-3BNME|20000908|||15.4|3||20120509|CRLF 
600|709209|154072|9||SBV|7103||||20120509|CRLF 
600|709209|154072|9||MC||AFSCHAFFING VAN DE EVOOR HET <BR><BR>(DE) GEBOUW(EN) <BR><BR><BR><BR>indien U huurder of gebruiker bent.|||20120509|CRLF 
600|709209|154072|9||MC||DIEFSTAL  CRLF 
Community
  • 1
  • 1
laar rommel
  • 33
  • 1
  • 4
  • Note to self and googlers; For simpler replacement requirements, see Notepad++ / Edit / EOL Conversion / and then pick the desired format, eg unix/mac or windows or old-mac. – AnneTheAgile Oct 23 '15 at 14:43

2 Answers2

1

Wow, this one phased me for a little while...
It's tricky to do it in one pass.

The N++ constraint probably makes it tougher than it needs to be, but short of writing some code to do what you want it's a good way to go I guess.

While I'm not sure it's optimal, I had success with this combo.
Find:

([^|])\r([\r\n])*

Replace:

$1<br>

You need the $1 in the replace or you lose a character from your replaced lines - probably not what you want!

Ideally, you should look into some Perl (I'm no perl advocate, other scripting languages handling regex are available...) or something to do this.

Edit: Just a thought. This makes the assumption that there won't be sections of your file that contain |CRLF or |CR or |CRCR that are not 'real' line endings.

BunjiquoBianco
  • 1,994
  • 2
  • 21
  • 24
  • thnx :) this gives me a workable file, i didn't know of the $1 in the replace. slightly better result would be a
    for every cr and crlf replaced instaid of one per block. but i can get away with this :) tnx again
    – laar rommel May 30 '12 at 05:59
1

Edit: Scrapped my last suggestions - didn't work

As suggested by BunjiquoBianco, I think that this is not possible in one pass.

Would be much better if you could use awk. If you are using Windows, try http://gnuwin32.sourceforge.net/packages/gawk.htm

If awk is a viable option, re-ask the question and the awk nuts will probably suggest a one-liner from command prompt to parse the whole file.

awk is fast too - would give you a much faster transformation and can be included in other scripts more easily thereby cutting out any manual N++ process.

Tiksi
  • 471
  • 1
  • 6
  • 15
  • extra tools are not really an option, its a very closed environment here. tnx for the effort and the gawk dous look usefull – laar rommel May 30 '12 at 05:57
  • Laar said "extra tools are not really an option" So if you are not prepared to use the right tool for the job, then why ask the question in the first place? I too would have thought this issue is easily solved using awk. – jussij May 30 '12 at 14:57