Extract the last element of multiple urls in a line

Question

I have unstructured data in the following template:

'<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P31c> <http://www.wikidata.org/entity/Q1454986> .',
'<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P227c> "4079154-3" .',
'<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P373c> "Universe" .'

I want to extract the last element of each url and the result should look like this:

'Q1 P31c Q1454986', 
'Q1 P227c 4079154-3', 
'Q1 P373c Universe'

I already tried some examples including this and this. But, mostly they only have one link in a line.

I am still learning regular expressions and I am not able to solve the above.

Why regexes specifically? There's easier ways to do this – GPhilo Oct 22 '19 at 12:26 — GPhilo, Oct 22 '19 at 12:26
`' '.join(re.findall(r'/([^/>]+)>', s))` – Wiktor Stribiżew Oct 22 '19 at 12:27 — Wiktor Stribiżew, Oct 22 '19 at 12:27

score 0 · Answer 1 · answered Oct 22 '19 at 12:35

0

You want 2 group one between / and >, and one between two "

/([^/>]+)>|\"([^\"]+)\"

answered Oct 22 '19 at 12:35

Jacek Rojek

1,082
8
16

score 0 · Accepted Answer · answered Oct 22 '19 at 12:46

Not always you have to use regex to get data.

Code is longer then with regex but it can be easier to understand it - so it can be easier to write it.

data = [
  '<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P31c>  <http://www.wikidata.org/entity/Q1454986> .',
  '<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P227c> "4079154-3" .',
  '<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P373c> "Universe" .'
]

for line in data:  # get lines
    result = []
    for item in line.split()[:3]:        # split line in items and skip last of them
        if item.startswith('<'):         # method for links
            item = item[1:-1]            # skip < >
            item = item.rsplit('/')[-1]  # get last element
        else:                            # method for not links
            item = item[1:-1]            # skip " "
        result.append(item)              # put on list
    print(' '.join(result))              # concatenate in one string

Extract the last element of multiple urls in a line

2 Answers2