1

I want to read two different files from sys.stdin, I can read and write file but there is no separation from first and second file.

When I run below code on cmd in Windows 10 and Python 3.6:

D:\digit>cat s.csv s2.csv

Result is:

1
2
3
4
5
1
2
3
4
5
6
7

I can print both files.

My Python code is:

import sys 
import numpy as np

train=[]
test=[]

#Assume below code is function 1 which just and must read s.csv
reader = sys.stdin.readlines()
for row in reader:          
    train.append(int(row[0]))
train = np.array(train)

print(train)

#I need some thing here to make separation
#sys.stdin.close()
#sys.stdin = sys.__stdin__ 
#sys.stdout.flush() 

#Assume below code is function 2 which just and must read s2.csv
reader = sys.stdin.readlines()
for row in reader:          
    test.append(int(row[0]))
test = np.array(test)

print(test)

I run below command on cmd prompt:

D:\digit>cat s.csv s2.csv | python pytest.py

Result is:

[1 2 3 4 5 1 2 3 4 5 6 7]
[]

Do I need reset sys.stdin for next file? I used below ones but none of them was answer:

sys.stdin.close()
sys.stdin = sys.__stdin__ 
sys.stdout.flush() 
halfer
  • 19,824
  • 17
  • 99
  • 186
Mahsa Hassankashi
  • 2,086
  • 1
  • 15
  • 25
  • What are the contents of the two files? – FlyingTeller Feb 26 '18 at 10:13
  • When you use `cat`, you send both the given files to `stdout`, and then you pipe that to `stdin` in your Python script. They are not separate: you are getting them both, concatenated. Your list `[1 2 3 4 5 1 2 3 4 5 6 7]` contains the contents of _both_ files. – khelwood Feb 26 '18 at 10:14
  • 1
    @FlyingTeller they are two files as s.csv and s2.csv , as I`ve shown. s.csv one column [1 2 3 4 5] and s2.csv [1 2 3 4 5 6 7]. – Mahsa Hassankashi Feb 26 '18 at 10:16
  • 1
    @khelwood I did not understand what you meant !? – Mahsa Hassankashi Feb 26 '18 at 10:18
  • `cat` concatenates the files. Your Python script receives the contents of both files, *concatenated, not separate*. – khelwood Feb 26 '18 at 10:19
  • @khelwood please please first read carefully my question, then edit or answer!! I asked at the end how can I reset sys.stdin to separate these files. – Mahsa Hassankashi Feb 26 '18 at 10:33
  • 1
    You can't. If you want the files separate, then don't concatenate them. You can read files inside your Python script. They don't need to go through stdin. – khelwood Feb 26 '18 at 10:37
  • @khelwood I want to inject two files to python script via sys.stdin, so what instruction can be separated them. – Mahsa Hassankashi Feb 26 '18 at 10:40

2 Answers2

2

Let me try to explain.

d:\digit>cat s.csv s2.csv

has only 1 output, not 2. What it does it 'streams' the content of file1 to stdout and then 'streams' the content of file2 to stdout, wihtout any pause or seperator!!

so only 1 'stream' of outputs, which then you redirect using the | to your pyton script:

| pytest.py

So pytest.py will receive 1 'stream' of inputs, it doesn't know any better or more.

If you want to process the files seperately by pytest.py, you can do the following

D:\digit>cat s.csv | python pytest.py # process the first file
D:\digit>cat s2.csv | python pytest.py # process the second file

or on a one liner:

D:\digit>cat s.csv | python pytest.py && cat s2.csv | python pytest.py

Just remember that the pytest.py is actually running twice. So you need to adapt your python script for this.

But while you are editing your python script...

What you should do: If you want both file in your pytest.py, then you should write some code to read both files in your python script. If it is csv structured data, then have a look at the csv module for reading and writing csv files

[EDIT based on comment:]

I could read multiple files it by pandas "pd.read_csv" , but my problem is how can I do it by sys.stdin?

You should really question why you are so focused on using stdin. Reading it from within the python script is likely to be much more effective.

If you must use stdin then you can deploy various, but external to python, headers, footers, separators. Once you have this defined and able to do so, then you can change the code in python to do various functions depending on what header/footer/separator is received from stdin.

This all sounds a bit complex and open for error. I would strongly advice you to reconsider the use of stdin as input for your script. Alternatively please update your question with the technical requirements and limitations you are facing which limits you to use stdin.

[EDIT based on comment:]

I want to load these files I Hadoop ecosystem and I am using Hadoop streaming for that

Somehow, you need to "signal" your python script that it is processing a new file, with new information.

Suppose you have 2 files, the first line need to be some sort of "header" indicating the file, and which function needs to execute on the remainder of the data, until a new "header" is received.

so lets say that your "train" data is prefixed with the line @is_train@ and your "test" data is prefixed with the line @is_test@

How you do that in your environment, is not part of the scope of this question

Now the redirection to stdin will send these two headers before the data. And you can have python to check for those, example:

import sys 
import numpy as np

train=[]
test=[]

is_train = False
is_test = False

while True:
    line = sys.stdin.readline()
    if '@stop@' in line:
        break
    if '@is_train@' in line:
        is_train = True
        is_test = False
        continue
    if '@is_test@' in line:
        is_train = False
        is_test = True
        continue
    #if this is csv data, you might want to split on ,
    line = line.split(',')
    if is_train:
        train.append(int(line[0]))
    if is_test:
        test.append(int(line[0]))

test = np.array(test)
train = np.array(train)

print(train)
print(test)

As you see at the code, you also need a "footer" to determine when the data has come to an end, in this example @stop@ is chosen.

One way of sending header/footers, could be:

D:\digit>cat is_train.txt s.csv is_test.txt s2.csv stop.txt | python pytest.py

and the three extra files, just contain the appropriate header or footer

Edwin van Mierlo
  • 2,398
  • 1
  • 10
  • 19
  • 1
    Thank you for answer, it seems better, but if I run any of commands, both different files will go inside same fuction, if I assume that each readin and print are different function. There is important problem that how I conduct each file should go through its own function. Is there any other way such as args = sys.stdin.readlines()[0] to distinguish between reading from I/O? – Mahsa Hassankashi Feb 26 '18 at 11:13
  • 1
    I could read multiple files it by pandas "pd.read_csv" , but my problem is how can I do it by sys.stdin? – Mahsa Hassankashi Feb 26 '18 at 11:19
  • 1
    I thought that have mentioned sys.stdin explicitly in different sentence and title. – Mahsa Hassankashi Feb 26 '18 at 11:29
  • Yes, but you don't explain why... and at this stage everyone has said the same thing. So now we wonder... why ? If the offerend solutions to your problem are not solving the problem, then we need to have more information to understand the restrictions which you imply (e.g. must read 2 files using stdin). Update your question with this info, so we can understand and help. help us to help you. – Edwin van Mierlo Feb 26 '18 at 11:41
  • 2
    Ahhha ok I found out confusion. I wanted to summarize and make my question simple to understand. I want to load these files I Hadoop ecosystem and I am using Hadoop streaming for that (I know I can also use pig, hive or Kafka) but I want to test map reducing with one python file which is only map and two files as train and test data for distribution across multiple machines in a Hadoop cluster. For more clarification please look at: https://stackoverflow.com/questions/48916243/python-hadoop-streaming-on-windows-script-not-a-valid-win32-application – Mahsa Hassankashi Feb 26 '18 at 12:16
  • AND https://stackoverflow.com/questions/48966769/python-hadoop-on-windows-cmd-one-mapper-and-multiple-inputs-error-subprocess – Mahsa Hassankashi Feb 26 '18 at 12:18
  • 2
    Although I think and I am sure in order to ask one question is better that everyone confine problem to specific domain for avoiding to judge about cons pros. This way can help who asks and answers to do more effectively if and only if they try to read better question. Otherwise I have to tag so many skills (such as Hadoop, map reduce and etc) which prevent to someone answer my question. – Mahsa Hassankashi Feb 26 '18 at 12:30
  • 1
    I wished that if anyone @khelwood can not read carefully question, could not push negative button to question which he could not read it twice !!!!!!!!!!! Or at least add why this question is -1 . It can decrees stackoverflow reputation so badly. – Mahsa Hassankashi Feb 26 '18 at 12:30
  • 1
    thank you so much, it seems to work and is good idea I am checking on IDE now. I will accept it. BTW it is good job. – Mahsa Hassankashi Feb 26 '18 at 16:12
  • 1
    When I tested it I got it [@ 4 5 @ 3 4 5 6 9 @]. Could I have [4 5] and [ 3 4 5 6 9]. I separated array or matrix? – Mahsa Hassankashi Feb 26 '18 at 16:29
  • @MahsaHassankashi I tested the example code, and it works, it does end up in 2 lists, and no '@' characters. I suggest that you compare your code to the sample code, and figure out where your code goes wrong. – Edwin van Mierlo Feb 27 '18 at 09:39
2

Another solution is:

import sys

train=[]

args = sys.stdin.readlines()[0].replace("\"", "").split()

for arg in args:
    arg=arg.strip()
    with open(arg, "r") as f:
        train=[]
        for line in f:
            train.append(int(line))   
        print(train)    

s.txt is:

1
2
3

s2.txt is:

7
8
9

D:\digit>echo s.txt s2.txt | python argpy.py
[1, 2, 3]
[7, 8, 9]

The key is two points:

  1. Using echo instead of cat in order to prevent concatenation The link to study more: Difference between 'cat < file.txt' and 'echo < file.txt'

  2. Try to read in for loop for each new file by splitting each file and store in args. How to run code with sys.stdin as input on multiple text files

Happy bc I`ve done it :)

Mahsa Hassankashi
  • 2,086
  • 1
  • 15
  • 25