-2

I'm trying to extract a ImageNet labels from the .txt file that is presented as follows.

998: 'ear, spike, capitulum',
999: 'toilet tissue, toilet paper, bathroom tissue'}

I've tried

label = []

txt = open("imagenet1000_clsid_to_human.txt").readlines()
#  print(str(txt))
p = re.compile(r"'(.*?)'")

#  print(txt)
for i in range(len(txt)):
    #  print(txt[i])
    #  print('\n')
    m = p.match(txt[i])

    if m:
        lis = list(m.group())[:-1]
        s = ''.join(lis)
        print(s)
        label.append(s)

to extract the substring inside the single quotation marks, but it continuously spits out 'None'.

I've tried in online regex compiler, and it worked perfectly fine. Can anybody give some advice for this issue?

Tomalak
  • 332,285
  • 67
  • 532
  • 628
jihan1008
  • 340
  • 1
  • 10

3 Answers3

0

The main problem is that you should be using re.search(), not re.match(). re.match() matches the pattern starting at the start of the string, there is an implied ^ at the start of the pattern.

It is wise to use a raw string for RE patterns, and you have overdone the brackets:

import re

txt = "998: 'ear, spike, capitulum', 999: 'toilet tissue, toilet paper, bathroom tissue'"

p = re.compile(r"'(.*?)'")
m = p.search(txt)
print(m.groups())

Gives:

('ear, spike, capitulum',)
cdarke
  • 42,728
  • 8
  • 80
  • 84
0

This works:

import re
re.findall(r"'(.*?)'", txt)

This regex link:

https://regex101.com/r/QP8omt/1

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
0

Not everything needs to be done through regex.

label = []

with open("imagenet1000_clsid_to_human.txt", 'r', encoding='utf8') as f:
    for line in f:
        parts = line.split("'")
        if len(parts) == 3:
            label.append(parts[1])

Side note: Always open text files with a specific encoding. If you are unsure what encoding the file is, then so is Python. There is no magic encoding detection and you should not rely on Python's defaults.

Tomalak
  • 332,285
  • 67
  • 532
  • 628