1

Is there a better manner to use the with open(file) as f: f.read() mechanism inside a for loop - i.e. a loop comprehension that operates on many files?

I am attempting to place this into a dataframe such that there is a mapping from file to file contents.

Here is what I have - but it seems to be inefficient and not pythonic/readable:

documents = pd.DataFrame(glob.glob('*.txt'), columns = ['files'])
documents['text'] = [np.nan]*len(documents)
for txtfile in documents['files'].tolist():
    if txtfile.startswith('GSE'):
        with open(txtfile) as f:
            documents['text'][documents['files']==txtfile] = f.read()

output:

    files   text
0   GSE2640_GSM50721.txt    | RNA was extracted from lung tissue using a T...
1   GSE7002_GSM159771.txt   Array Type : Rat230_2 ; Amount to Core : 15 ; ...
2   GSE1560_GSM26799.txt    | C3H denotes C3H / HeJ mice whereas C57 denot...
3   GSE2171_GSM39147.txt    | HIV seropositive , samples used to test HIV ...
chase
  • 3,592
  • 8
  • 37
  • 58
  • In terms of readability, I don't see the problem here. it seems fairly clear what you're trying to accomplish, though I would add some documentation notes at the top of the file, class, or function; indicating the desired function in readable human language. As for efficiency, I'm not sure there is a better approach: I haven't done that research. However, I have been told by professors and more experienced programmers **DON'T PRE-OPTIMIZE** Or, [Optimization is the root of *almost* all evil.](https://stackoverflow.com/questions/385506/when-is-optimisation-premature) – David Culbreth Jan 16 '19 at 22:01
  • Possible duplicate of [When is optimisation premature?](https://stackoverflow.com/questions/385506/when-is-optimisation-premature) – David Culbreth Jan 16 '19 at 22:05
  • @DavidCulbreth I was mainly seeing if there is something remarkably simple (as there usually is for python) like `{file:file.readstr() for file in filelist}` – chase Jan 16 '19 at 22:15
  • Ah. That makes sense. Given that you have to `open(...)`, `glob(...)`, and `DataFrame()` in use I don't think that a one-liner would be attainable while still readable. If one does exist, this deliberate notation is very likely more readable. Since you're going through 4? different kinds of structures, I think the naive approach that you've already presented is very likely the most readable, and probably not any faster/slower than your initial algorithm. – David Culbreth Jan 16 '19 at 22:23
  • This is perfectly Pythonic. What isn't Pythonic about it? Seems very readable to me, using common python idioms. You should always use a `with` statement to handle files. That is very Pythonic. – juanpa.arrivillaga Jan 16 '19 at 22:54

2 Answers2

2

Your code looks perfectly readable. Perhaps you were looking for something like this (Python3 only):

import pathlib

documents = pd.DataFrame(glob.glob('*.txt'), columns = ['files'])
documents['text'] = documents['files'].map(
    lambda fname: fname.startswith('GSE') and pathlib.Path(fname).read_text())
Marat
  • 15,215
  • 2
  • 39
  • 48
  • Note, there is a [pathlib2](https://pypi.org/project/pathlib2/) package that is a backport for Python2. – Andrew F Jan 17 '19 at 13:45
0

You can do:

# import libraries
import os,pandas

# list filenames, assuming your path is './'
files = [i for i in os.listdir('./') if i[:3]=='GSE' and i[-3:]=='txt']

# get contents of files
contents = []
for i in files:
    with open(i) as f: contents.append(f.read().strip())

# into a nice table 
table = pandas.DataFrame(contents, index=files, columns=['text'])
aerijman
  • 2,522
  • 1
  • 22
  • 32
  • This doesn't use a context-manager. Also, this goes against PEP8 style guidelines by assigning a lambda to a name. Just use a full function definition. – juanpa.arrivillaga Jan 17 '19 at 00:04
  • Thank you @juanpa.arrivillaga. What happens if you assign a lambda to a name? You also made me understand better the question and I changed my suggestion accordingly. – aerijman Jan 17 '19 at 13:44