0

I have unstructured data in the following template:

'<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P31c> <http://www.wikidata.org/entity/Q1454986> .',
'<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P227c> "4079154-3" .',
'<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P373c> "Universe" .'

I want to extract the last element of each url and the result should look like this:

'Q1 P31c Q1454986', 
'Q1 P227c 4079154-3', 
'Q1 P373c Universe'

I already tried some examples including this and this. But, mostly they only have one link in a line.

I am still learning regular expressions and I am not able to solve the above.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
ellisa21
  • 21
  • 5

2 Answers2

0

You want 2 group one between / and >, and one between two "

/([^/>]+)>|\"([^\"]+)\"
Jacek Rojek
  • 1,082
  • 8
  • 16
0

Not always you have to use regex to get data.

Code is longer then with regex but it can be easier to understand it - so it can be easier to write it.

data = [
  '<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P31c>  <http://www.wikidata.org/entity/Q1454986> .',
  '<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P227c> "4079154-3" .',
  '<http://www.wikidata.org/entity/Q1> <http://www.wikidata.org/entity/P373c> "Universe" .'
]

for line in data:  # get lines
    result = []
    for item in line.split()[:3]:        # split line in items and skip last of them
        if item.startswith('<'):         # method for links
            item = item[1:-1]            # skip < >
            item = item.rsplit('/')[-1]  # get last element
        else:                            # method for not links
            item = item[1:-1]            # skip " "
        result.append(item)              # put on list
    print(' '.join(result))              # concatenate in one string
furas
  • 134,197
  • 12
  • 106
  • 148