4

Hello, world! I'm in a regex trouble. I'm using a HTTP API (for searching italian trains) that give me those informations (for example):

10911 - SESTO S. GIOVANNI|10911-S01325

Format:

TRAIN_NUMBER - STATION|TRAIN_NUMBER - STATION_CODE

Until there were few requests all it's ok, because I need only one information, "S01325". But when users began to grow I discovered that there may be two trains with the same numbering. For example, the train 612 can have two numerations, in fact the API gives me:

612 - TARANTO|612-S11465
612 - ASSO|612-N00079

When (using urllib.request module) I try to read this, I get:

b'612 - TARANTO|612-S11465\n612 - ASSO|612-N00079\n'

I need have two list variables:

A = ['612 - TARANTO', '612 - ASSO'] #First regex expression
B = ['S11465', 'N00079'] #Second regex expression

I must use REGEX, true? I never used REGEX, so I don't know what I must do. I searched on Google and on the Wiki(s) / docs. But I don't found (404) the solution of this problem. Obviously the regex expression must works for all cases, for example:

b'2097 - MILANO CENTRALE|2097-S01700\n'

Give me:

A = ['2097 - MILANO CENTRALE']
B = ['S01700']

Another example:

b'123 - ROMA TERMINI|123-S01358\n123 - TREVIGLIO|123-S01703\n'

Give me:

A = ['123 - ROMA TERMINI', '123 - TREVIGLIO']
B = ['S01358','S01703']

Thanks, thanks very much for reading. I hope I was clear. Have a good day, Marco P.S.: Link to the italian docs

MarcoBuster
  • 1,145
  • 1
  • 13
  • 21

4 Answers4

4

You don't need regular expressions, actually. You can use them though. There's a rather simple pattern in your information:

<Train number> - <city>|<Train number>-<identifier>

So let's look at what happens if you do

>>> '123 - ROMA TERMINI|123-S01358'.split('|', 1)
['123 - ROMA TERMINI', '123-S01358']

So now you have the first part of what you want. The second part can then be fixed using something similar, let's look at

>>> '123-S01358'.split('-', 1)
['123', 'S01358']

So you can do

>>> '123-S01358'.split('-', 1)[-1]
'S01358'

And you're done!

If you combine all of this together you should get your answer.

Ian Stapleton Cordasco
  • 26,944
  • 4
  • 67
  • 72
2

I must use REGEX, true?

Not true.

I think a better solution is to parse each line into tokens and assign them to sensible variables. You need a solution that is less about string primitives and regex; more about objects and encapsulation.

I'd design a REST API that let me query for trains easily and return the response as JSON objects.

duffymo
  • 305,152
  • 44
  • 369
  • 561
0

First, you have to convert your bytearrays to str objects.

With the examples you provided:

examples = [
    b'2097 - MILANO CENTRALE|2097-S01700\n',
    b'123 - ROMA TERMINI|123-S01358\n',
    b'123 - TREVIGLIO|123-S01703\n'
]

Assuming that format is:

[TRAIN_NAME]|[TRAIN_NAME_REPEATED]-[TRAIN_NUMBER]\n

We don't need any regexes, we can simply split entries by delimiters:

for example_bytes in examples:
    example = example_bytes.decode("utf-8").split("|")
    # example = ['2097 - MILANO CENTRALE', '2097-S01700\n']

    train_name = example[0]
    # train_name = '2097 - MILANO CENTRALE'

    train_number = example[1].split("-")[1]
    # train_number = 'S01358'

    A.append(train_name)
    B.append(train_number.rstrip())

Then to see the result:

print(A)
# ['2097 - MILANO CENTRALE', '123 - ROMA TERMINI', '123 - TREVIGLIO']
print(B)
# ['S01700', 'S01358', 'S01703']

If you don't want your entries to be repeated (if it's even possible), I'd suggest you using sets instead of lists.

Check the API documentation, you depend on the format it provides entries in.

Community
  • 1
  • 1
Jezor
  • 3,253
  • 2
  • 19
  • 43
0

You can actually get the data you want in json format making the correct post, for * Treno - Stazione* using the code for ROMETTA MESSINESE:

from pprint import pprint as pp
import requests
import datetime

station = "S12049"
dt = datetime.datetime.utcnow()
arrival = "http://www.viaggiatreno.it/viaggiatrenonew/resteasy/viaggiatreno/arrivi/{station}/{iso}"
with requests.Session() as s:
   r = s.get(departure.format(station=station, iso=dt.strftime("%a %b %d %Y %H:%M:%S GMT+000 (UTC)")))
   pp(r.json())

And departure:

arrival = "http://www.viaggiatreno.it/viaggiatrenonew/resteasy/viaggiatreno/partenze/{station}/{iso}"
with requests.Session() as s:
   r = s.get(arrival.format(station=station, iso=dt.strftime("%a %b %d %Y %H:%M:%S GMT+000 (UTC)")))
   pp(r.json())
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Nope. This is the arrivals of a station. I need the information of the train. For this, I need station_of_departure_id and train_number. Using the train_number and the method above, I can have the id of the station of departure. – MarcoBuster Jul 10 '16 at 12:52
  • @MarcoBuster, this is one example, all the data is requested using ajax requests where the data is in json format, all data can be retrieved the same way. What url were you using to get the train? – Padraic Cunningham Jul 10 '16 at 12:54
  • @MarcoBuster, I added both for arrival and departure which contains all the information you see on the page and more in json format – Padraic Cunningham Jul 10 '16 at 13:20