0

I have to format 50k lines of chat logs.

The source file is pure text and looks something like this:

13. Mär. 01:32 - Walter:  
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.

 13. Mär. 06:15 - Horst:  
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, 
sed diam nonumy eirmod tempor invidunt ut labore et 
dolore magna aliquyam erat, sed diam voluptua.
magna aliquyam erat, sed diam voluptua.

There are only two persons in the whole chat - Walter and Horst. I need two regular expressions, one that selects all chat text from Walter and one that selects all chat text from Horst.

The regular expression for Walter should select this text from the example:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.

The regular expression for Horst should select this text from the example:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, 
sed diam nonumy eirmod tempor invidunt ut labore et 
dolore magna aliquyam erat, sed diam voluptua.
magna aliquyam erat, sed diam voluptua.

It's important to me to only select the text lines and not the date / time / person line.

UPDATE First off, thanks for the fast reply. Unfortunately this doesn't solve my problem. Chat texts have a varying line of numbers.

And somehow I cannot get a selection with your example.

I tried it here: http://regexr.com/39m2a

I tried this instead: Walter:.\n(.)

This selects Walter: and the first line. Is there away NOT to select Walter: ?

(I need this to format an Indesign Document using text formats)

Nick Volynkin
  • 14,023
  • 6
  • 43
  • 67

3 Answers3

1

These are actualy 2 questions

  1. How to do a match across newlines (asked in the question title)
  2. How to do a match that discards the date/time/person (asked in the question body)

I'll answer question 1:

Before doing the match you want to change the line separator/record separator.

This separator is tool dependent (it is not part of the regex language itself). E.g. for awk you can change the RS variable (you can set it to multiple characters, e.g., colon+newline). For GNU grep you can use -z. See longer discussion at

How to find patterns across multiple lines using grep?

Community
  • 1
  • 1
1

Here's my solution:

awk '$5~/Walter:$/{p=1} $5!~/Walter:$/&&$5~/:$/{p=0} p'

or

awk -vname=Walter 'match($5,name":$"){p=1} !match($5,name":$")&&$5~/:$/{p=0} p'

To filter out empty and date lines, pipe through

awk '$5!~":$"&&NF>0'
Vytenis Bivainis
  • 2,308
  • 21
  • 28
0

try it here: http://refiddle.co/1iws

Walter:  \n.*

I have modified the regex so could work on you data, but once again your data isn't well structured though it's not possible to write a single regex that would match it correctly

mousetail
  • 7,009
  • 4
  • 25
  • 45
Kristijan Iliev
  • 4,901
  • 10
  • 28
  • 47