1

I can't figure out why my below code won't assign a new value to the nth row inside the for loop. As far as I know, the way I index the b matrix should be correct but it seems like the count variable won't update for each iteration. The print statements serve only as a way of checking what is going on.

I assume that it's pretty simple, so I would highly appreciate if one could point out were I'm wrong.

#!/usr/bin/python
import sys
#from string import maketrans
#import re
import numpy as np

lines = sum(1 for line in sys.stdin)
b = np.zeros((lines,2))

count = 0
for line in sys.stdin:

    line = line.strip()
    myline = line.split(",")

    Depart = myline[3]
    DepartDelay = float(myline[6])  

    if DepartDelay<0:

        DepartDelay=0

    b[count,0] = Depart
    b[count, 1] = DepartDelay

    count = count + 1
    print(count)
print(b)
print(count)    

I use the following command to execute the code within the terminal of Ubuntu.

cat sample.txt | mapper.py

which is why there as such aren't specified any data/text file.

In advance, thank you!

Kristian Nielsen
  • 159
  • 1
  • 11

2 Answers2

0

sys.stdin is basically a file object, so it's just my opinion not to do something like for line in sys.stdin (it may technically work, but it's bad form. plus, you have other issues, all ahead)

i would prefer to call read() or readlines() methods on sys.stdin to read the contents

get familiar with the basics of stdin here: How do you read from stdin in Python?

a loop that would work for you should look like this:

lines = sys.stdin.readlines()
for line in lines:
    do_something(line)

but be careful, if you iterate over the entire stdin at the start of your program (when you did lines = sum(1 for line in sys.stdin)) you can't simply start iterating again

a more simple approach for you will be to read all the lines as mentioned above, but if you need the length of lines you can simply do this first:

count_lines = len(lines)

to summarize, the start of your program should be this:

lines = sys.stdin.readlines()
b = []
count = 0
for line in lines:
    Depart, DepartDelay = do_something(line)
    b.append([Depart, DepartDelay])

EDIT: i wouldnt use numpy at all for such a simple problem with multiple types to store (float and string)

Ofer Sadan
  • 11,391
  • 5
  • 38
  • 62
  • thank you for your reply! I'll will implement your modifications to make my code more smooth. But could you please let me know how this would solve my initial problem? – Kristian Nielsen Jun 05 '17 at 11:05
  • sure, but you have to be more specific - give an example of the desired output, and what do you get instead? is there an error or simply unexpected behavior? define what your problem is if you want a solution – Ofer Sadan Jun 05 '17 at 11:08
  • 2
    You can absolutely [iterate over text files](https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects), and it's encouraged for both clarity and efficiency (readlines forces loading the entire file into memory). The only thing to keep in mind is that they're iterable, not sequences, so you may not have a chance to rewind or reread. – Yann Vernier Jun 05 '17 at 11:31
  • @YannVernier tnx for the corrections, edited them into my answer – Ofer Sadan Jun 05 '17 at 11:35
  • Ofer Sadan I actually though that I have specified my problem but I can see when I read my initial post once a again, that I maybe wasn't that specific. My problem is that nothing gets appended to the b matrix inside the for loop. I will try to implement @YannVernier solution since this one for me seems most logically, and as I see it, i handles some of the same problems that you mentioned. I will let you guys know, when I have tried to implement your solutions. – Kristian Nielsen Jun 05 '17 at 12:23
  • I have now tried to implement both of your solutions. Ofer Sadan, your solution work fine - except for the fact that I'm allowed to assign element of the type string to the matrix b. Because of the fact that the matrix created with numpy has to contain elements of the same type as far as I understand. @YannVernier it seems to be the same problem I when I try your solution. Does one of you have thoughts on how I can come around this problem? Does it imply that numpy aren't suited for this task? – Kristian Nielsen Jun 05 '17 at 12:41
  • but thinking about your problem, you also have a float to store in the array... i would avoid numpy for such a simple problem, just use a regular empty list like `b=[]` and `mylist.append([Depart,DepartDelay]) to it – Ofer Sadan Jun 05 '17 at 12:56
  • If I would use a regular empty list, shouldn't it then be 'b' instead for 'mylist' in the append function. How do I ensure that it is updated with a new line each time? since I'm not able to write b[count].append([...]) right? and then update count for each iteration. – Kristian Nielsen Jun 05 '17 at 13:04
  • 1
    Thank you for taking you the time to help me on this, I think it works now. I will try to get on with my task to see if works in the end. A last question, is it possible to print, what would be the first column in a matrix, for a list? – Kristian Nielsen Jun 05 '17 at 13:12
  • Yes with a second loop through b array – Ofer Sadan Jun 05 '17 at 13:16
0

The core problem seems to be that you're reading from sys.stdin twice. Once in the argument to sum, which will read the entire input, then again in the for loop. Because files have a current position, the usual result is that the for loop gets nothing to process. stdin is also likely to be a stream, so cannot be rewound. You must load the data only once.

A second question is if you can load the data using a higher abstraction level. It looks like you're reading a CSV, for which csv.reader might be useful, but it's collected in a numpy array, which makes numpy.loadtxt even more appealing. It even has a usecols field to read specific columns.

The count variable can also be handled a little more easily, using for count, line in enumerate(sys.stdin):. This will increment it along with the reading of lines.

I think a decent starting point is something like:

b = np.loadtxt(sys.stdin, delimiter=',', usecols=(3,6))
b[:,1] = np.maximum(b[:,1], 0)   # Set no lower than 0
Yann Vernier
  • 15,414
  • 2
  • 28
  • 26