Matching text between a pair of single quotes

Question

I'm trying to extract a ImageNet labels from the .txt file that is presented as follows.

998: 'ear, spike, capitulum',
999: 'toilet tissue, toilet paper, bathroom tissue'}

I've tried

label = []

txt = open("imagenet1000_clsid_to_human.txt").readlines()
#  print(str(txt))
p = re.compile(r"'(.*?)'")

#  print(txt)
for i in range(len(txt)):
    #  print(txt[i])
    #  print('\n')
    m = p.match(txt[i])

    if m:
        lis = list(m.group())[:-1]
        s = ''.join(lis)
        print(s)
        label.append(s)

to extract the substring inside the single quotation marks, but it continuously spits out 'None'.

I've tried in online regex compiler, and it worked perfectly fine. Can anybody give some advice for this issue?

Well, I haven't show all my .txt files, but it consists of multiple lines! — jihan1008, Oct 06 '18 at 07:03
Oh yes. Sorry, misread your post. I’ll edit my answer - findall is still what you need. — T Burgis, Oct 06 '18 at 07:04

score 0 · Answer 1 · answered Oct 06 '18 at 06:55

The main problem is that you should be using re.search(), not re.match(). re.match() matches the pattern starting at the start of the string, there is an implied ^ at the start of the pattern.

It is wise to use a raw string for RE patterns, and you have overdone the brackets:

import re

txt = "998: 'ear, spike, capitulum', 999: 'toilet tissue, toilet paper, bathroom tissue'"

p = re.compile(r"'(.*?)'")
m = p.search(txt)
print(m.groups())

Gives:

('ear, spike, capitulum',)

score 0 · Answer 2 · answered Oct 06 '18 at 06:56

0

This works:

import re
re.findall(r"'(.*?)'", txt)

This regex link:

https://regex101.com/r/QP8omt/1

answered Oct 06 '18 at 06:56

Rahul Agarwal

4,034
7
27
51

Tomalak · Answer 3 · 2018-10-06T07:05:35.797

0

Not everything needs to be done through regex.

label = []

with open("imagenet1000_clsid_to_human.txt", 'r', encoding='utf8') as f:
    for line in f:
        parts = line.split("'")
        if len(parts) == 3:
            label.append(parts[1])

Side note: Always open text files with a specific encoding. If you are unsure what encoding the file is, then so is Python. There is no magic encoding detection and you should not rely on Python's defaults.

edited Oct 06 '18 at 07:05

answered Oct 06 '18 at 07:00

Tomalak

332,285
67
532
628

Thanks! But where did `len(parts) == 3` came from? – jihan1008 Oct 06 '18 at 07:05
I inserted that to make sure that only lines are considered that contain exactly two single quotes, i.e. exactly 3 parts after splitting. – Tomalak Oct 06 '18 at 07:06
Got it! Thanks :D – jihan1008 Oct 06 '18 at 07:09

Matching text between a pair of single quotes

3 Answers3