delete everything except URL with python

Question

I have a JSON file that contains metadata for 900 articles. I want to delete all the data except for the lines that contain URLs and resave the file as .txt. I created this code but I couldn't continue the saving phase:

import re

with open("path\url_example.json") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         print(urls)

A part of the results:

['http://www.google.com.']
['https://www.tutorialspoint.com']

Another issue is the results are marked between [' '] and may end with . I don't need this. My expected result is:

 http://www.google.com
 https://www.tutorialspoint.com

I'd have thought `"path\url_example.txt"` would raise a `SyntaxError` as well... — Jon Clements, Dec 15 '18 at 16:36
Could you show an example of your input file? Is it a JSON object per line for instance? If so, does it have attributes called "url" or "link" or "href" or whatever, so that you can parse the line as json using `json.loads` and then just retrieve the appropriate parts instead of regexing stuff out? — Jon Clements, Dec 15 '18 at 16:38

Densetsu_No · Answer 1 · 2018-12-15T16:50:53.763

0

Without further information on the file you have (txt, json?) and on the kind of input line you are looping through, here a simple try without re.findall().

with open("path\url_example.txt") as handle:
    for line in handle:
        if not re.search('http'):
            continue
        spos = line.find('http')
        epos = line.find(' ', spos)
        url = line[spos:epos]
        print(url)

edited Dec 15 '18 at 16:50

answered Dec 15 '18 at 16:41

Densetsu_No

63
6

*I guess your file is a txt and not a json otherwise your code wouldn't work.* - well, it would if it was one json object per line, or formatted such that it's pretty printed and the urls happen to be accessible on a single line... :) – Jon Clements Dec 15 '18 at 16:43
Also... that `re.search` could `if 'http' not in line`... also... try running your code with `line = 'http://example.com'`... you'll get the wrong output... – Jon Clements Dec 15 '18 at 16:45
Modified the intro text, should more precise. – Densetsu_No Dec 15 '18 at 16:54
Given an input of `http://example.com` where there isn't a space, you end up with `epos == -1` which means you slice off the last character giving you an output of: `'http://testing.co'`... – Jon Clements Dec 15 '18 at 16:56

Benjamin Rowell · Answer 2 · 2018-12-15T16:48:55.070

If you know which key your URLs will be found under in your JSON, you might find an easier approach is to deserialize the JSON using the JSON module from the Python standard library and work with a dict instead of using regex.

However, if you want to work with regex, remember urls is a list of regex matches. If you know there's definitely only going to be only one match per line, then just print the first entry and rstrip off the terminal ".", if it's there.

import re

with open("path\url_example.txt") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         print(urls[0].rstrip('.'))

If you expect to see multiple matches per line:

import re

with open("path\url_example.txt") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         for url in urls:
             print(url.rstrip('.'))

You can just `print(url.rstrip('.'))` - seems a bit of waste using the if/else here to check it ends with `.` to remove it... just print it stripped, and if it didn't have a dot, it still won't, and if it did, it won't now... so no need to check it first. — Jon Clements, Dec 15 '18 at 16:47
@JonClements thanks for picking that up, having a dim moment. — Benjamin Rowell, Dec 15 '18 at 16:49

delete everything except URL with python

2 Answers2