2

This looks like an easy task but I'm not being able to achieve it with LibreOffice Calc REGEX formulas. I have a bunch of exams, each one with 20 multiple choice questions in this format:

1. Which of the following statements about producers is false?
a. Households produce many goods and services for themselves.
b. People set up some producers who do not aim to make profits.
c. All the goods and services consumed in any country are produced by its own producers.
d. Governments arrange the production of some goods and services.

Some questions have several paragraphs with new line characters in between. What I want to achieve is to capture/extract each one of these strings (question / answer a / answer b / answer c / answer d) using REGEX in LibreOffice. The idea is having one REGEX for each string to separate the info in different cells like this: libreoffice sheet example

For now I managed to do the matching with an online regex tester: https://regex101.com/r/jhfr63/1

To capture the question string: (^\d+\.[\S\s]+?(?=^a\.))
And to capture for instance answer "a": (^a.[\S\s]*?(?=^b\.))

I haven't had any luck doing the same with LibreOffice REGEX, so I'd need some help if possible. After changing the above regex the best I could achieve was to extract the question, but it only works for the first match, and I'm not too sure how it works:

=REGEX(A1,"^\d+\.[\S\s]*(?=\b[a]\.)",,1)
ennanz
  • 21
  • 3

2 Answers2

2

The problem seems to be the ^ in both of your regexes. This will match only at the beginning of the cell's content, but not after line breaks. For line breaks, you'll have to search for \n. This is the crucial point, since in your use case, the line breaks are delimiting the components of your result.

Try the following (surely not the best solution, but it did work for me).

Assuming that A1 contains the questions/answers, you'll have to put the following formulas into B1:F1 (first formula into B1, second line into C1 and so on):

=REGEX($A$1;"(?<=^|\n)\d\.(.|\n)*?(?=\na\.)";;ROW())
=REGEX($A$1,"(?<=\n)a\..*(?=\n+b\.)",,ROW())
=REGEX($A$1,"(?<=\n)b\..*(?=\n+c\.)",,ROW())
=REGEX($A$1,"(?<=\n)c\..*(?=\n+d\.)",,ROW())
=REGEX($A$1,"(?<=\n)d\..*(?=(\n+\d\.)|$)",,ROW())

Then, just select B1:F1 and pull the formulas down until B20:F20. Since i've made use of the ROW() function, every line will select the nth Question / Answer.

Explanation:

  • As first element, i've used a positive lookbehind (?<=\n), so it matches after a hard line break (only for the Question pattern, i've used (?<=\n|^) since the first question starts immediately at the beginning of the cell.
  • next, i've searched for
    • one or more digits followed by a dot (Column B - questions: \d+\.) or
    • a single letter followed by a dot (Column C:F - answers: a\. to d\.).
  • next, select everything (.*) delimited by a positive lookahead ((?=\n[a]\.)). The lookahead pattern depends on the column:
    • for the question column, look for one or more linebreaks \n+ followed the beginning of the first answer a\.;
    • to find the end of the first answer, look for one or more linebreaks \n+ followed the beginning of the second answer b\.
    • and so on for 3rd and 4th answer.
  • Only for the last answers pattern, i've used a different positive lookahead (?=(\n+\d\.)|$) since after answer 20.d, there's no \n21. to match, so i had to include the end of the cell $ as alternative option.

It would be much better to have a single regex for all of the questions. This might be possible, but it would be quite difficult to achieve (depending on the column number, select one of a-d). Thus, i've used one regex for each of the answers.

tohuwawohu
  • 13,268
  • 4
  • 42
  • 61
  • 1
    Thanks a lot for your detailed answer! This is what I need. It's almost there... the answers (a to d) work perfectly fine, but for some reason in the Question regex I'm getting a #N/A in those cells. – ennanz Aug 04 '22 at 21:32
  • 1
    @ennanz: you're right, I've stumbled over the newlines in the questions myself. `.` doesn't match newlines, so my initial version worked only if the questions don't contain newlines. Otherwise, the `.*` would stop at the first newline, with the lookahead failing, thus no matches. To capture the question part, instead of `\d+\..*`, `\d\.(.|\n)*`is required. – tohuwawohu Aug 04 '22 at 22:12
  • 1
    Yes! That did it, only I had to add a "+" sign after the first "d" or it would only work up to exercise 9. So the final regex to capture the questions is as follows: `=REGEX($A$1,"(?<=^|\n)\d+\.(.|\n)*?(?=\na\.)",,ROW())`. Thank you! – ennanz Aug 05 '22 at 01:29
  • 1
    I now noticed an minor error in the REGEX for "answer d", I had to add a "+" sign after the "d" in the lookahead or it would pick up questions in some of the answer matches. Something else I discovered now (that I didn't specify in the original question) is there were some newlines in some of the answers too. – ennanz Aug 05 '22 at 02:26
  • 1
    To recap: the five regex searches that solved the problem are: – ennanz Aug 05 '22 at 02:34
  • 1
    `=REGEX$A$1;"(?<=^|\n)\d+\.(.|\n)*?(?=\na\.)";;ROW())` – ennanz Aug 05 '22 at 02:35
  • 1
    `=REGEX($A$1,"(?<=\n)a\.(.|\n)*?(?=\n+b\.)",,ROW())` – ennanz Aug 05 '22 at 02:35
  • 1
    `=REGEX($A$1,"(?<=\n)b\.(.|\n)*?(?=\n+c\.)",,ROW())` – ennanz Aug 05 '22 at 02:36
  • 1
    `=REGEX($A$1,"(?<=\n)c\.(.|\n)*?(?=\n+d\.)",,ROW())` – ennanz Aug 05 '22 at 02:36
  • 1
    `=REGEX($A$1,"(?<=\n)d\.(.|\n)*?(?=(\n+\d+\.)|$)",,ROW())` – ennanz Aug 05 '22 at 02:36
  • I'm not sure how I mark this as resolved, can´t find any "approve" button. Is it OK if I update the regex in your post to the final version? – ennanz Aug 05 '22 at 19:05
  • 1
    @ennanz: Did the update myself :-) – tohuwawohu Aug 06 '22 at 07:33
2

Without regex:

enter image description here

Formula in B1:

=TRIM(FILTERXML("<t><s>"&REGEX(SUBSTITUTE(SUBSTITUTE($A$1,"&","&amp;"),"<","&lt;"),"(?<=\n|^)(?=\d+\.|[a-d]\.)","</s><s>","g")&"</s></t>","//s[node()]["&COLUMN(A1)+ROW(A1)*5-5&"]"))

Drag 5 columns right and x-amount rows down untill no more values.


Regex pattern: (?<=\n|^)(?=\d+\.|[a-d]\.) means:

  • (?<=\n|^) - Positive lookbehind to assert position is preceded by newline character or start-string;
  • (?=\d+\.|[a-d]\.) - Positive lookbehind to assert position is followed by 1+ digits or character a-d and literal dot.

Xpath expression //s[node()]["&COLUMN(A1)+ROW(A1)*5-5&"] means:

  • //s[node()] - Any non-empty nodes;
  • ["&COLUMN(A1)+ROW(A1)*5-5&"] - Nested math to return the appropriate index from elements.
JvdV
  • 70,606
  • 8
  • 39
  • 70
  • 2
    IMHO a great and very elegant approach. The problem is that the "questions" blocks in A1 may _"have several paragraphs with new line characters in between"._ Thus, simply splitting the input by the newline characters (`CHAR(10)`) will fail in those case, since it puts the parts of those questions into different cells. – tohuwawohu Aug 04 '22 at 21:26
  • 1
    Thanks for your solution too! As tohuwawohu said I'm having problems when there are new lines in between, the example I wrote in the original was an optimal case with only one line of text for the question. With this solution the desired cell contents are "offset" from their position. – ennanz Aug 04 '22 at 21:37
  • 2
    Thanks for the feedback @tohuwawohu. I didn't quite get that part from reading the question and currently unavailable to get behind my pc to rectify this. It requires some tweaks to make it work with a nested Regex probably to only transform the appropriate newline character to the correct xpath tag ends. – JvdV Aug 04 '22 at 22:00
  • 1
    @ennanz, thanks for the comment and clarification. I'm away from pc for another 6 hours or so but I think this is salvageable. Do you have examples of these newlines? – JvdV Aug 04 '22 at 22:02
  • 2
    @tohuwawohu, I can't right now but instead of substitute, I'd then use Regex to replace `^(?=[\da-d]\.)` with `` and keep the xpath etc intact. Maybe a `TRIM()` at the end. – JvdV Aug 04 '22 at 22:10
  • JvdV, great, not sure how to send you a text here, but I have an example exam in the original post, in the regex online test I made: [link](https://regex101.com/r/jhfr63/1). Ehem... now I noticed also some answers have new lines too... – ennanz Aug 05 '22 at 01:41
  • @ennanz, have a look at the edited answer. Maybe that works. Just notice I also included some `SUBSTITUTE()` to prevent possible errors when the ampersand and left angle bracket would be included in the text. See [this](https://stackoverflow.com/a/61837697/9758194) answer I made a while ago to see why this works. – JvdV Aug 05 '22 at 05:59
  • @JvdH, that worked perfectly! I'll have to study the xpath part as I haven't used that before. Thanks! – ennanz Aug 05 '22 at 13:59