11

In Perl, to lowercase a textfile, I could do the following lowercase.perl:

#!/usr/bin/env perl

use warnings;
use strict;

binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");

while(<STDIN>) {
  print lc($_);
}

And on the command line: perl lowercase.perl < infile.txt > lowered.txt

In Python, I could do with lowercase.py:

#!/usr/bin/env python
import io
import sys

with io.open(sys.argv[1], 'r', 'utf8') as fin:
    with io.open(sys.argv[2], 'r', 'utf8') as fout:
        fout.write(fin.read().lower())

And on the command line: python lowercase.py infile.txt lowered.txt

Is the Perl lowercase.perl different from the Python lowercase.py?

Does it stream the input and lowercase it as it outputs? Or does it read the whole file like the Python's lowercase.py?

Instead of reading in a whole file, is there a way to stream the input into Python and output the lowered case byte by byte or char by char?

Is there a way to control the command-line syntax such that it follows the Perl STDIN and STDOUT? E.g. python lowercase.py < infile.txt > lowered.txt?

alvas
  • 115,346
  • 109
  • 446
  • 738
  • The Perl program basically streams, yes. If you run it without params, the loop will wait for input from the command line until you end the program or send `EOF`. – simbabque Apr 25 '16 at 13:32
  • 1
    @alvas Python have `for line in sys.stdin:` which afaik will act the same as `while($line = )` – Andreas Louv Apr 25 '16 at 14:29

7 Answers7

7

Python 3.x equivalent for your Perl code may look as follows:

#!/usr/bin/env python3.4
import sys

for line in sys.stdin:
    print(line[:-1].lower(), file=sys.stdout)

It reads stdin line-by-line and could be used in shell pipeline

Denis Shatov
  • 91
  • 1
  • 5
3

Slightly off topic (depending on your definition of "Perl") but maybe of interest...

perl6 -e  ' .lc.say for "infile.txt".IO.lines ' > lowered.txt

This neither processes "byte by byte" nor "whole file" but "line by line". .lines creates a lazy list so you will not use a ton of memory if your file is large. The file is presumed to be text (meaning you get Str's rather than Buf's of bytes when you read) and the encoding defaults to "Unicode" - meaning open will try to figure out what UTF is used and if it can't it will presume UTF-8. Details here.

By default line endings are chomp'ed as you read and put back on by say - if the processing requirements prohibit that, you can pass the boolean, named parameter :chomp to .lines (and use .print rather than .say) ;

$ perl6 -e  ' .lc.print for "infile.txt".IO.lines(:!chomp) ' > lowered.txt

You can avoid the IO redirection and do it all in perl6 but this will read the whole file in as one Str;

$ perl6 -e  ' "lowered.txt".IO.spurt: "infile.txt".IO.slurp.lc '
Marty
  • 2,788
  • 11
  • 17
3

There seem to be two interleaved issues here and I address that first. For how to make both Perl and Python use either invocation with a very similar behavior see the second part of the post.

Short: They differ in how they do I/O but both work line-by-line, and Python code is easily changed to allow the same command-line invocation as Perl code. Also, both can be written so to allow input either from file or from standard input stream.


(1)   Both of your solutions are "streaming," in the sense that they both process input line-by-line. Perl code reads from STDIN while Python code gets data from a file, but they both get a line at a time. In that sense they are comparable in efficiency for large files.

A standard way to both read and write files line-by-line in Python is

with open('infile', 'r') as fin, open('outfile', 'w') as fout:
    fout.write(fin.read().lower())

See, for example, these SO posts on processing a very large file and read-and-write files. The way your read the file seems idiomatic for line-by-line processing, see for example SO posts on reading large-file line-by-line, on idiomatic line-by-line reading and another one on line-by-line reading.

Change the first open here to your io.open to directly take the first argument from the command line as the file name, and add modes as needed.

(2)   The command line with both input and output redirection that you show is a shell feature

./program < input > output

The program is fed lines through the standard input stream (file descriptor 0). They are provided from the file input by the shell via its < redirection. From gnu bash manual (see 3.6.1), where "word" stands for our "input"

Redirection of input causes the file whose name results from the expansion of word to be opened for reading on file descriptor n, or the standard input (file descriptor 0) if n is not specified.

Any program can be written to do that, ie. act as a filter.  For Python you can use

import sys   
for line in sys.stdin:
    print line.lower()

See for example a post on writing filters. Now you can invoke it as script.py < input in a shell.

The code prints to standard output, which can then be redirected by shell using >. Then you get the same invocation as for the Perl script.

I take it that the standard output redirection > is clear in both cases.


Finally, you can bring both to a nearly identical behavior, and allowing either invocation, in this way.

In Perl, there is the following idiom

while (my $line = <>) {
    # process $line
}

The diamond operator <> either takes line by line from all files submitted on the command line (which are found in @ARGV), or it gets its lines from STDIN (if data is somehow piped into the script). From I/O Operators in perlop

The null filehandle <> is special: it can be used to emulate the behavior of sed and awk, and any other Unix filter program that takes a list of filenames, doing the same to each line of input from all of them. Input from <> comes either from standard input, or from each file listed on the command line. Here's how it works: the first time <> is evaluated, the @ARGV array is checked, and if it is empty, $ARGV[0] is set to "-" , which when opened gives you standard input. The @ARGV array is then processed as a list of filenames.

In Python you get practically the same behavior by

import fileinput
for line in fileinput.input():
    # process line

This also goes through lines of files named in sys.argv, defaulting to sys.stdin if list is empty. From fileinput documentation

This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty. If a filename is '-', it is also replaced by sys.stdin. To specify an alternative list of filenames, pass it as the first argument to input(). A single file name is also allowed.

In both cases, if there are command-line arguments other than file names more need be done.

With this you can use both Perl and Python scripts in either way

lowercase < input > output
lowercase input   > output

Or, for that matter, as cat input | lowercase > output.


All methods here read input and write output line-by-line. This may be further optimized (buffered) by the interpreter, the system, and shell's redirections. It is possible to change that so to read and/or write in smaller chunks but that would be extremely inefficient and noticeably slow down programs.

zdim
  • 64,580
  • 5
  • 52
  • 81
2

Is the Perl lowercase.perl different from the Python lowercase.py?

The Python file takes filenames for input and output. The Perl file does streaming (e.g. may be used in some_command | your_perl_script.pl | some_other command).

Does it stream the input and lowercase it as it outputs? Or does it read the whole file like the Python's lowercase.py?

while(<STDIN>) {

walks through your input line by line. As long as your input contains \n (default line break, may be changed by setting $/). This is streaming.

Instead of reading in a whole file, is there a way to stream the input into Python and output the lowered case byte by byte or char by char?

Probably yes, but I don't know Python :(

Sebastian
  • 2,472
  • 1
  • 18
  • 31
1

In the example the only difference is how the data is accessed. One is by opening a file (the python version) the other is by piping i/o to the program (perl version). Either language can access the data by either method.

Examples working with stdin/stdout in python:

Community
  • 1
  • 1
jason
  • 423
  • 1
  • 5
  • 8
1

I see two questions here:

  1. how to lowercase text without reading in an entire file: Read it line by line
  2. how to handle commandline arguments and default to stdin if none: Use fileinput.

Here's how:

To lowercase text, just use fin.readline() or just iterate the file object (which reads one line at a time):

for line in fin:
    ...

To handle filenames specified on the command line, with stdin if none, use fileinput. If you'll just be sending everything to stdout, this will be enough:

for line in fileinput.input():
    print(line.lower(), end="")

But if you want to lowercase a large corpus and store the result to disk, it's likely that you'll want to output each file separately. That's a little bit more work, since fileinput won't automatically redirect your output. Here's one way:

currentname = None
for line in fileinput.input():
    if fileinput.isfirstline():
        if currentname and currentname != "<stdin>":  # clean up after previous file
            fout.close()

        currentname = fileinput.filename()        # Set up for new file
        if currentname == "<stdin>":
            fout = sys.stdout
        else:
            fout = open(currenttname+"-low", "w"
    fout.write(line.lower())

)

I wrote out each file <name> to <name>-low, but you can of course substitute any other approach (e.g., use the same name for output but in a different directory).

alexis
  • 48,685
  • 16
  • 101
  • 161
0

The Python program will attempt to read the whole input file. The call of read() without an argument will read till EOF, see the io module documentation.

There is a small bug as well, fout should be opened in the "w" mode.

As mentioned by @denis-shatov, it is possible to write a Python script equivalent to the Perl one.

Mike Bessonov
  • 676
  • 3
  • 8