Splitting lines of a text file into part of a tuple-Python

Question

Im working on an assembler program and chose to use Python over C (mostly because of what Python can do with lists and I wanted to learn it)

My question is how do I split each line of a text file into part of a tuple?

Eg test file is:

ADD R1,R2;  
OR R1,R3;

and have code to parse it into this

UserProgram=[['ADD','R1','R2'],['OR','R1','R3']]

It would also have to ignore comments after the semicolon. Thanks!

Coming from a C background, Python seems a little strange. Predictably enough I tried using for loops to split each element (a single line) of the list. I also tried multiple delimiter splitting but could not get that to run as well. The program I had to do before this was a 5-stage pipelined architecture simulator. Wish I knew Python better because it seems it would have lent itself better. — Phillip Tatti, Apr 09 '12 at 16:04

He Shiming · Answer 1 · 2012-04-10T00:20:47.843

2

>>> s = "ADD R1,R2; OR R1,R3;"
>>> t1 = s.split(';')
>>> t1
['ADD R1,R2', ' OR R1,R3', '']
>>> UserProgram = [t.strip().replace(',', ' ').split(' ') for t in t1 if len(t) > 0]
>>> UserProgram
[['ADD', 'R1', 'R2'], ['OR', 'R1', 'R3']]
>>>

By the way, square brackets indicates lists, not tuples.

edited Apr 10 '12 at 00:20

answered Apr 09 '12 at 06:05

He Shiming

5,710
5
38
68

-1 This misses additional requirements (they may have been edited in later?) and building a list manually is clumsy for simple cases like this. C programmers coming to Python should IMO be strongly encouraged to get away from their habits of thinking in terms of direct iteration. – Karl Knechtel Apr 10 '12 at 00:03
@KarlKnechtel, hi, I've edited it to use shortcut list creation. – He Shiming Apr 10 '12 at 00:21

San4ez · Answer 2 · 2012-04-10T06:00:22.397

1

>>> import re
>>> [re.split('\W+', s.strip()) for s in 'ADD R1,R2; OR R1,R3;'.split(';') if s]
[['ADD', 'R1', 'R2'], ['OR', 'R1', 'R3']]

UPD:

python -m timeit -s "import re; regexp = re.compile('\W+');" "[regexp.split(s.strip()) for s in 'ADD R1,R2; OR R1,R3;'.split(';') if s]"
100000 loops, best of 3: 3.34 usec per loop

python -m timeit "[t.strip().replace(',', ' ').split(' ') for t in 'ADD R1,R2; OR R1,R3;'.split(';') if t]"100000 loops, best of 3: 2.1 usec per loop

BTW my variant is not bad, although a little slower

edited Apr 10 '12 at 06:00

answered Apr 09 '12 at 06:09

San4ez

8,091
4
41
62

Why would you use `re.split` to get default behaviour of `str.split` without an argument? That also doesn't require you to `strip()` the string first. – Karl Knechtel Apr 10 '12 at 00:05
Nope, `string.split` will split only spaces, not commas. You all do `replace(',', ' ')` before splitting but I suggest another implementation with regexp – San4ez Apr 10 '12 at 05:44

Abhijit · Answer 3 · 2012-04-09T06:22:10.483

If your source is in this format

source="""
ADD R1,R2;
OR R1,R3;
"""

then you can simply split the source linearly via splitlines() and then split again with ; as the delimiter discarding anything after ';'

sourcelines=[x.split(";")[0].replace(',',' ').split() 
             for x in source.splitlines() if x]
[['ADD', 'R1', 'R2'], ['OR', 'R1', 'R3']]

You can also proceed forward and split each ASM source line as OP-Code and individual Operands.

[[token.split(',') for token in x.split(";")[0].split()] 
  for x in source.splitlines() if x]

You would get something like

[[['ADD'], ['R1', 'R2']], [['OR'], ['R1', 'R3']]]

score 1 · Answer 4 · answered Apr 10 '12 at 00:00

So we have a source file in that format.

We want a list of tokens for each line in the file.

The tokens are the result of chopping off everything after the first semicolon, and splitting up the rest on either comma or whitespace. We can do that by replacing commas with spaces, and then just splitting on whitespace.

So we turn to the standard library. The split method of strings splits on whitespace when you don't give it something to split. The replace method lets us replace one substring with another (for example, ',' with ' '). To remove everything after a semicolon, we can partition it and take the first part (element 0 of the result).* The processing for an individual line thus looks like

line.partition(';')[0].replace(',', ' ').split()

and then we simply do this for each line of the file. To get a list of results of applying some function to elements of a source, we can ask for it directly, using a list comprehension (where basically we describe what the resulting list should look like). A file object in Python is a valid source of lines; you can iterate over it (this concept is probably more familiar to C++ programmers) and the elements are lines of the file.

So all we need to do is open the file (we'll idiomatically use a with block to manage the file) and produce the list:

with open('asm.s') as source:
    parsed = [
        line.partition(';')[0].replace(',', ' ').split()
        for line in source
    ]

Done.

* or use split again, but I find this is less clear when it's not actually your goal to produce a list of elements.

ChessMaster · Answer 5 · 2012-04-10T05:56:57.187

0

>>>s = "ADD R1,R2; OR R1,R3;"
>>>[substr.split() for substr in s.replace(',',' ').split(';')[:-1]]
[['ADD', 'R1', 'R2'], ['OR', 'R1', 'R3']]

edited Apr 10 '12 at 05:56

answered Apr 09 '12 at 06:55

ChessMaster

529
1
4
12

An assembly comment starts at the first semicolon of the line, not the last; you should use `[0]` instead of `[:-1]` here. – Karl Knechtel Apr 10 '12 at 00:06
When you split on `;` everything before the first semicolon is what you actually want. In your example string, `OR R1,R3;` is actually a comment because it appears after the `;`. The `;` == '#' in Python, so you need a newline. – Wayne Werner Jul 25 '12 at 13:52

Splitting lines of a text file into part of a tuple-Python

5 Answers5