0

I'm trying to perform some regex searches on a QTextEdit widget which has rich text format. Visually it displays properly but regex ignores the numerous <BR> line breaks and sees the entire body of text as one large single line.

If I display the text as plain text and use \n for new lines instead, regex search works perfectly interpreting each line as it's own line. However plain text as the name suggests has no rich text formatting which I need.

Is there anyway for regex to interpret an HTML line break as a new line instead of \n or any way I can get regex to work properly with HTML? I tried adding <BR>\n but that doesn't do anything.

I'm using QTextEdit.find( QRegExp ) with PyQT5 for Python.


Here's an example of what's happening:

Regex pattern: Lorem.+

Text body:

Lorem ipsum dolor sit amet.

Consectetur adipiscing elit.

sed do eiusmod tempor incididun

Expected match:

Full match  0-28    `Lorem ipsum dolor sit amet. `

Actual match:

Full match  0-89    `Lorem ipsum dolor sit amet.Consectetur adipiscing elit.sed do eiusmod tempor incididunt`

I don't think it's the issue of the dot matching everything including new line character because when use the exact same regex pattern in plain text mode using \n as new lines the dot doesn't include new line characters too. This only happens when I set the text QTextEdit with HTML instead of plain text

Community
  • 1
  • 1
DonutPilot
  • 11
  • 5

3 Answers3

1

Instead of using <br> to indicate a new line in rich text HTML put each line between <div> tags and when using regex it will see the end of a div as the end of a line.

Really odd fix but it works

For some odd reason in QT the <br> line break tag when using rich text HTML in a QTextEdit isn't interpreted as \n when using regex.

Really odd fix but it works, the text content doesn't change when converting back to plain text and its visually exactly the same as using <br>.

Not sure if this is an error with QT or if there's a reason for this.

Dorian Turba
  • 3,260
  • 3
  • 23
  • 67
DonutPilot
  • 11
  • 5
0

As far as I know, you go in the wrong way if you want to parse HTML with REGEX :

RegEx match open tags except XHTML self-contained tags

Have you tried using an XML parser instead?

Try this : Lorem.+?(<\\Br>|\n)

Dorian Turba
  • 3,260
  • 3
  • 23
  • 67
  • Problem is that the .find() method in the QTextEdit takes either a string or QRegExp class, if I were to use something other than those I'd lose a lot of the functionality which I need. My very last option would be to re-implement the .find() method however I'm hoping there's a much easier solution to just detect a
    and have it know the line ends there.
    – DonutPilot Jun 19 '18 at 15:11
  • Can't you keep using regex, and another tool for this specific thing ? The XML parser may give you the correct string, no ? – Dorian Turba Jun 19 '18 at 15:15
  • I can do QTextEdit.toPlainText() and that parses the HTML including the line breaks and use regex on that but I wouldn't achieve anything with it. .find() selects the results. – DonutPilot Jun 19 '18 at 15:33
  • Could you edit you question with some RegEx you have tried ? – Dorian Turba Jun 19 '18 at 15:57
  • The 'Lorem.+' regex example in the question is exactly what happens. It selects everything when using
    but only the first line if I'm using plain text and \n as a line break.
    – DonutPilot Jun 19 '18 at 17:26
  • Ok. Could you try this ? `Lorem.+?(<\\Br>|\n)` – Dorian Turba Jun 19 '18 at 20:06
  • It didn't work, I found a solution though. I think it might be a bug from Qt's end and how the rich text format gets interpreted when using regex. When I add `

    ` or `

    ` tags and use regex, it picks up that there's a the new line. However not when I use the `
    ` tag. I ended up putting each line between div tags and that solved the new line character not appearing when using regex.
    – DonutPilot Jun 19 '18 at 20:54
  • So you find the answer ? You can write an answer to yourself explaining how you find it, how it work etc like me in https://stackoverflow.com/questions/50822068/why-numpy-asarray-return-an-array-full-of-boolean, or you can edit my answer in order to explain it. I hope I helped you. – Dorian Turba Jun 20 '18 at 10:02
  • 1
    Thanks for your help, I appreciate it! :) I posted the answer, I wasn't able to mark it as answered at the time but the 2 day window passed so now I marked it. – DonutPilot Jun 21 '18 at 21:23
0

First off, when trying to figure out regular expressions, regex101.com is your friend.

Second, you probably want to use QRegularExpression. QRegEx is really heading toward deprecation and not nearly as powerful (or compliant) as QRegularExpression.

That said, let's look at the possible ways the "html" you are trying to capture could be written as a plain string.

Lorem ipsum dolor sit amet.<br>Consectetur adipiscing elit.<br>sed do eiusmod tempor incididun

The first thing to try would be Lorem.+<br> (note: no \n), but that will match too much.

Regular expressions are greedy by default. That means they will try to match as much as possible, giving you the first two lines. So, we need to tell + to not be greedy. If you are using QRegularExpression, you can use the non-greedy qualifier to come up with Lorem.+?<br>.

jwernerny
  • 6,978
  • 2
  • 31
  • 32
  • I am aware of QRegularExpression vs QRegEx, but QTextEdit.find() only takes strings and QRegEx parameters as inputs. It hasn't implemented yet for QRegularExpression. I'm pretty much limited to only being able to use the QRegEx class. The problem with QTextEdit is that it will visually show you the line breaks but I think when regex looks at it there are no line breaks. When setting the HTML of the TextEdit it also strips any `\n` in your code. So doing a search for `
    ` or `\n` returns no matches, the text is literally stores as one long line.
    – DonutPilot Jun 19 '18 at 18:18
  • 1
    regex101.com is a great tool but because of how QTextEdit displays rich format text vs stores it it's not much use for a scenario like this. – DonutPilot Jun 19 '18 at 18:18
  • @DonutPilot Why don't you retrieve the data as HTML into a QString from QTextEdit.toHTML(), then use QRegularExpression.match(..)? – jwernerny Jun 20 '18 at 20:18
  • QTextEdit.find() finds and also selects the text with the text cursor in the QTextEdit field which then I'm using to apply special text formatting to. – DonutPilot Jun 21 '18 at 21:21