1

I use Python3 and I want to read and print the first Nth rows of a .txt (the file is 40GB+ so I can't open it because of RAM limitations). I just want to understand the file structure (columns, variable names, separators,...). With the below code, Python gives me [] as an output (instead of the printed lines I'd want):

from itertools import islice

with open("filename.txt") as myfile:
    head = list(islice(myfile, 1, 25))
print(head)

I also tried adding 'r' next to the file name, but did not succeed. I only want to be able to read the first Nth rows (be it 25 rows, 5,10, or 15, this I don't care).

Hi, responses below addressed me to the .txt file (and not python code). I completely changed my approach and try to read initial 100 rows using pd.read_csv as follows:

dfcontact2 = pd.read_csv('filename.txt', sep='|', names=['col1'], nrows=100)
dfcontact2.head(5)

The code outputs:

enter image description here

where row 0 are variable names. I do not see any '\n' character at the end of each row, so I guess the file is not structured in lines, but why is then the output offered in rows? What am I missing?

Thanks a lot for your time. Best,

martins
  • 441
  • 1
  • 5
  • 19
  • Does this answer your question? [only reading first N rows of csv file with csv reader in python](https://stackoverflow.com/questions/50490257/only-reading-first-n-rows-of-csv-file-with-csv-reader-in-python) – Chris Jan 17 '21 at 00:48
  • So open it then `readline()` for those `n` lines. – DisappointedByUnaccountableMod Jan 17 '21 at 00:49
  • 5
    Are you sure there is more than one line in that file? BTW, `islice(myfile, 1, 25)` is not the first 25 lines, it is skipping line `0`. – zvone Jan 17 '21 at 01:08
  • 1
    The fix for what @zvone is mentioning on skipping line `0` is to just do `head = list(islice(myfile, 25))`. But it's clear that your file has at most one line in it; otherwise you'd get a non-empty `list`; your code is (mostly) fine, your data is bad. Are you sure you're reading the file you think you're reading? You might have the same file name in both the working directory and whatever you *think* is where it's reading from, and you're not reading from the one you expect. – ShadowRanger Jan 17 '21 at 01:41
  • @zvone Thanks. Seemingly there are no lines (i.e. no '\n' characters), though I don't understand the output (see new edits in post) is offered in row format... – martins Jan 17 '21 at 18:22

2 Answers2

4

Your code looks fine, so it's likely an issue with your file instead!

The file myfile.txt (now filename.txt)

  • does not have more than a single row of content (your logic skips the first line as zvone notes in a comment), so when read from index 1 (line 2), you'll find it evaluates as approximately

    >>> list(islice(["file line 1"], 1, 25))
    []
    

    further examples

    >>> list(islice(["file line 1"], 25))  # don't skip line 1
    ['file line 1']
    >>> list(islice(["file line 1", "file line 2"], 1, 25)) # multiple lines
    ['file line 2']
    
  • does exist (does not raise FileNotFoundError)

    >>> open("foo.missing")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    FileNotFoundError: [Errno 2] No such file or directory: 'foo.missing'
    

Testing, the code works for me, changing to read from index 0 instead of 1

>>> with open("myfile2.txt", 'w') as fh:
...     for x in range(100):
...         fh.write("line {}\n".format(x))
...
[output clipped]
>>> from itertools import islice
>>> with open("myfile2.txt") as fh:
...     head = list(islice(fh, 25))
...
>>> head
['line 0\n', 'line 1\n', 'line 2\n', 'line 3\n', 'line 4\n', 'line 5\n', 'line 6\n', 'line 7\n', 'line 8\n', 'line 9\n', 'line 10\n', 'line 11\n', 'line 12\n', 'line 13\n', 'line 14\n', 'line 15\n', 'line 16\n', 'line 17\n', 'line 18\n', 'line 19\n', 'line 20\n', 'line 21\n', 'line 22\n', 'line 23\n', 'line 24\n']
ti7
  • 16,375
  • 6
  • 40
  • 68
  • Thanks @ti7! You and @zvone made me change the approach. I see the file is not structure in lines (no '\n' characters), though I wonder why the output is given in row format then... (edit in original post) – martins Jan 17 '21 at 18:23
  • @martins Excellent! With the update, it looks like there's at least some tab-separation too which may want consideration. When read into a dataframe by Pandas, it'll attempt to make rows out of the contents (presumably this is occurring at `|`). You may get good results setting tabs as the separator, or from the docs try out `sep=None` in the hopes that the (quite clever) `csv.Sniffer` can figure it out for you https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html – ti7 Jan 17 '21 at 20:39
  • `\t` is the tab separator (similar to `\n` for a new line) – ti7 Jan 17 '21 at 20:39
-1
head = []
with open("filename.txt") as myfile
    for _ in range(25):
        head.append(myfile.readline())
print(head)

File IO is based on caching strategy. The entire file will not be on memory if too big. If you readline, only some chunk of around the line will be cached.

If it fails, I think the file is not a text file or it does not contain \n so single readline exhausts the entire memory.

ghchoi
  • 4,812
  • 4
  • 30
  • 53
  • This is worse than what the OP is doing on multiple levels (why remove the `with` statement?), and won't help if the OP's code doesn't work already. – ShadowRanger Jan 17 '21 at 01:40
  • @ShadowRanger Did he try `readline` already? Or, a file dose contain '\n' or it is not a text file? How can't `realine` work...? – ghchoi Jan 17 '21 at 01:48
  • This has a few issues, but `.readline()` is probably fine; notably, it leaks an open file handle (`f.close()` is never called explicitly, or by the `with` context manager which does for you) and the file-like returned by `open` is already iterable! – ti7 Jan 17 '21 at 01:51