Using regular expressions to extract string from text file

Question

Essentially i have a txt document with this in it,

The sound of a horse at a gallop came fast and furiously up the hill.
"So-ho!" the guard sang out, as loud as he could roar.
"Yo there! Stand! I shall fire!"
The pace was suddenly checked, and, with much splashing and floundering, a man's voice called from the mist, "Is that the Dover mail?"
"Never you mind what it is!" the guard retorted. "What are you?"
"_Is_ that the Dover mail?"
"Why do you want to know?"
"I want a passenger, if it is."
"What passenger?"
"Mr. Jarvis Lorry."
Our booked passenger showed in a moment that it was his name.
The guard, the coachman, and the two other passengers eyed him distrustfully.

Using regex i need to print everything within double quotes, I dont want the full code i just need to know how i should go about doing it, which regex would be most useful. Tips and pointers please!

I would go with this beautiful [tutorial](https://docs.python.org/3.4/howto/regex.html). — Sait, Aug 12 '15 at 00:12
So I should not post an answer that does what you want? That's less of what SO is about :). — Cyphase, Aug 12 '15 at 00:13

score 3 · Accepted Answer · answered Aug 12 '15 at 00:14

3

r'(".*?")' will match every string within double quotes. The parentheses indicate a captured group, the . matches every character (except for a newline), the * indicates repetition, and the ? makes it non-greedy (stops matching right before the next double-quote). If you want, include the re.DOTALL option to make . also match newline characters.

answered Aug 12 '15 at 00:14

TigerhawkT3

48,464
6
60
97

1

Specifically, the `*` indicates any number (zero or more) of the immediately preceding pattern-- in this case `.`, which is any character. – Aug 12 '15 at 00:23
i run into a problem where on the line with multiple " "'s it prints that on the same line as the previous one whereas i need to put it on a newline, is there anyway to make it so after it finds the second " it does a newline? – Nick Adams Aug 12 '15 at 01:04
@NickAdams - Instead of `print(''.join(strings))`, do `print(*strings, sep='\n')`. – TigerhawkT3 Aug 12 '15 at 02:20

Cyphase · Answer 2 · 2015-08-12T00:22:51.163

0

This should do it (explanation below):

from __future__ import print_function

import re

txt = """The sound of a horse at a gallop came fast and furiously up the hill.
"So-ho!" the guard sang out, as loud as he could roar.
"Yo there! Stand! I shall fire!"
The pace was suddenly checked, and, with much splashing and floundering,
a man's voice called from the mist, "Is that the Dover mail?"
"Never you mind what it is!" the guard retorted. "What are you?"
"_Is_ that the Dover mail?"
"Why do you want to know?"
"I want a passenger, if it is."
"What passenger?"
"Mr. Jarvis Lorry."
Our booked passenger showed in a moment that it was his name.
The guard, the coachman, and the two other passengers eyed him distrustfully.
"""

strings = re.findall(r'"(.*?)"', txt)

for s in strings:
    print(s)

Result:

So-ho!
Yo there! Stand! I shall fire!
Is that the Dover mail?
Never you mind what it is!
What are you?
_Is_ that the Dover mail?
Why do you want to know?
I want a passenger, if it is.
What passenger?
Mr. Jarvis Lorry.

r'"(.*?)"' will match every string within double quotes. The parentheses indicate a capture group, so you'll only get the text without the double-quotes. The . matches every character (except for a newline), and the * means "zero or more of the last thing", the last thing being the .. The ? after the * makes the * "non-greedy", which means it matches as little as possible. If you didn't use the ?, you'd only get one result; a string containing everything between the first and last double-quote.

You can include the re.DOTALL flag so that . will also match newline characters, if you want to extract strings that cross lines. If you want to do that, use re.findall(r'"(.*?)"', txt, re.DOTALL). The newline will be included in the string, so you'd have to check for that.

Explanation unavoidably similar to / based on @TigerhawkT3's answer. Vote that answer up, too!

edited Aug 12 '15 at 00:22

answered Aug 12 '15 at 00:16

Cyphase

11,502
2
31
32

If you see that your answer is identical to an earlier one, it's generally preferable to simply delete your duplicate rather than copy the older answer into your own. This helps cut down on "noise" for visitors. – TigerhawkT3 Aug 12 '15 at 00:32
@TigerhawkT3, understood, but it's not just a duplicate. I already had the code and output there, and wanted to add an explanation, but I didn't want to obfuscate it _just_ so that it would be very different from yours :). Should I have just left my answer without an explanation, and expected viewers to read both answers, just so there wasn't any overlap? Obviously there's often going to be some overlap in answers to the same question. I did explain a tiny bit more than you did; but again, any good explanations are going to be similar. And I did include a mention of your answer anyway :). – Cyphase Aug 12 '15 at 00:37
you are reading from a string not a text file like me. when i do something like this but from a text file if there are multiple quotes on a line it will print them on the same line. Is there any way to do a newline after the second " – Nick Adams Aug 12 '15 at 01:33
@NickAdams, when you read from a text file, the text goes into a string, so it should work the same. Can you show the code you're using? – Cyphase Aug 12 '15 at 01:38
import re with open('atotc.txt') as f: for line in f: strings = re.findall(r'"(.*?)"', line) print (''.join(strings)) – Nick Adams Aug 12 '15 at 01:46
@NickAdams, it's printing multiple quotes on the same line because you're doing `print(''.join(strings))`. Try `for s in strings: print(s)`. Perhaps this will answer or make irrelevant [the new question you just posted](https://stackoverflow.com/questions/31954717/splitting-a-string-before-the-nth-occurrence-of-a-character) :). – Cyphase Aug 12 '15 at 01:49

Using regular expressions to extract string from text file

2 Answers2

Linked