Python regex? I'm in a trouble

Question

Hello, world! I'm in a regex trouble. I'm using a HTTP API (for searching italian trains) that give me those informations (for example):

10911 - SESTO S. GIOVANNI|10911-S01325

Format:

TRAIN_NUMBER - STATION|TRAIN_NUMBER - STATION_CODE

Until there were few requests all it's ok, because I need only one information, "S01325". But when users began to grow I discovered that there may be two trains with the same numbering. For example, the train 612 can have two numerations, in fact the API gives me:

612 - TARANTO|612-S11465
612 - ASSO|612-N00079

When (using urllib.request module) I try to read this, I get:

b'612 - TARANTO|612-S11465\n612 - ASSO|612-N00079\n'

I need have two list variables:

A = ['612 - TARANTO', '612 - ASSO'] #First regex expression
B = ['S11465', 'N00079'] #Second regex expression

I must use REGEX, true? I never used REGEX, so I don't know what I must do. I searched on Google and on the Wiki(s) / docs. But I don't found (404) the solution of this problem. Obviously the regex expression must works for all cases, for example:

b'2097 - MILANO CENTRALE|2097-S01700\n'

Give me:

A = ['2097 - MILANO CENTRALE']
B = ['S01700']

Another example:

b'123 - ROMA TERMINI|123-S01358\n123 - TREVIGLIO|123-S01703\n'

Give me:

A = ['123 - ROMA TERMINI', '123 - TREVIGLIO']
B = ['S01358','S01703']

Thanks, thanks very much for reading. I hope I was clear. Have a good day, Marco P.S.: Link to the italian docs

Have you tried to write a regex expression? If you have please add it to your question, it makes it easier to help — Barnabus, Jul 10 '16 at 12:03
What is the api? I imagine there is a good chance you can get the data in a much more usable format — Padraic Cunningham, Jul 10 '16 at 12:08
@PadraicCunningham this is the only API that working. Yes, I hate it. — MarcoBuster, Jul 10 '16 at 12:17
@Padriac Cunningham the API are unofficial, so there aren't a official documentation. But, there are a unofficial documentation in italian on Github. I can link it, but it's in ITALIAN. https://github.com/sabas/trenitalia — MarcoBuster, Jul 10 '16 at 12:25

score 4 · Accepted Answer · answered Jul 10 '16 at 12:06

You don't need regular expressions, actually. You can use them though. There's a rather simple pattern in your information:

<Train number> - <city>|<Train number>-<identifier>

So let's look at what happens if you do

>>> '123 - ROMA TERMINI|123-S01358'.split('|', 1)
['123 - ROMA TERMINI', '123-S01358']

So now you have the first part of what you want. The second part can then be fixed using something similar, let's look at

>>> '123-S01358'.split('-', 1)
['123', 'S01358']

So you can do

>>> '123-S01358'.split('-', 1)[-1]
'S01358'

And you're done!

If you combine all of this together you should get your answer.

score 2 · Answer 2 · answered Jul 10 '16 at 12:03

I must use REGEX, true?

Not true.

I think a better solution is to parse each line into tokens and assign them to sensible variables. You need a solution that is less about string primitives and regex; more about objects and encapsulation.

I'd design a REST API that let me query for trains easily and return the response as JSON objects.

score 0 · Answer 3 · edited May 23 '17 at 12:16

First, you have to convert your bytearrays to str objects.

With the examples you provided:

examples = [
    b'2097 - MILANO CENTRALE|2097-S01700\n',
    b'123 - ROMA TERMINI|123-S01358\n',
    b'123 - TREVIGLIO|123-S01703\n'
]

Assuming that format is:

[TRAIN_NAME]|[TRAIN_NAME_REPEATED]-[TRAIN_NUMBER]\n

We don't need any regexes, we can simply split entries by delimiters:

for example_bytes in examples:
    example = example_bytes.decode("utf-8").split("|")
    # example = ['2097 - MILANO CENTRALE', '2097-S01700\n']

    train_name = example[0]
    # train_name = '2097 - MILANO CENTRALE'

    train_number = example[1].split("-")[1]
    # train_number = 'S01358'

    A.append(train_name)
    B.append(train_number.rstrip())

Then to see the result:

print(A)
# ['2097 - MILANO CENTRALE', '123 - ROMA TERMINI', '123 - TREVIGLIO']
print(B)
# ['S01700', 'S01358', 'S01703']

If you don't want your entries to be repeated (if it's even possible), I'd suggest you using sets instead of lists.

Check the API documentation, you depend on the format it provides entries in.

Padraic Cunningham · Answer 4 · 2016-07-10T13:33:11.487

0

You can actually get the data you want in json format making the correct post, for * Treno - Stazione* using the code for ROMETTA MESSINESE:

from pprint import pprint as pp
import requests
import datetime

station = "S12049"
dt = datetime.datetime.utcnow()
arrival = "http://www.viaggiatreno.it/viaggiatrenonew/resteasy/viaggiatreno/arrivi/{station}/{iso}"
with requests.Session() as s:
   r = s.get(departure.format(station=station, iso=dt.strftime("%a %b %d %Y %H:%M:%S GMT+000 (UTC)")))
   pp(r.json())

And departure:

arrival = "http://www.viaggiatreno.it/viaggiatrenonew/resteasy/viaggiatreno/partenze/{station}/{iso}"
with requests.Session() as s:
   r = s.get(arrival.format(station=station, iso=dt.strftime("%a %b %d %Y %H:%M:%S GMT+000 (UTC)")))
   pp(r.json())

edited Jul 10 '16 at 13:33

answered Jul 10 '16 at 12:49

Padraic Cunningham

176,452
29
245
321

Nope. This is the arrivals of a station. I need the information of the train. For this, I need station_of_departure_id and train_number. Using the train_number and the method above, I can have the id of the station of departure. – MarcoBuster Jul 10 '16 at 12:52
@MarcoBuster, this is one example, all the data is requested using ajax requests where the data is in json format, all data can be retrieved the same way. What url were you using to get the train? – Padraic Cunningham Jul 10 '16 at 12:54
@MarcoBuster, I added both for arrival and departure which contains all the information you see on the page and more in json format – Padraic Cunningham Jul 10 '16 at 13:20

Python regex? I'm in a trouble

4 Answers4