2

I have a data file that I need certain lines skipped.

(1 1),skip,this
skip,this,too
1,2,3
4,5,6
7,8,9
10,11,12
(1 2),skip,this
skip,this,too
...

They do repeat every after 4 data entries. I tried the ones from this post Pandas: ignore all lines following a specific string when reading a file into a DataFrame but the line nothing is being skipped and it's turning the dataframe into a MultiIndex.

I tried looping by using startswith() and appending to a list, however, the data is being inputted into a single column.

I'm trying to obtain this output:

1,2,3
4,5,6
7,8,9
10,11,12

There's multiple files each one containing over 7M rows. I'm looking for a fast, memory efficient way to accomplish this.

Something I tried was create a list to skip row 0,1, then again 6, 7. Is it possible to achieve through that?

Community
  • 1
  • 1
Leb
  • 15,483
  • 10
  • 56
  • 75
  • Why not just pre-parse the file and remove any line that contains "skip" in it. – James Mertz Oct 14 '15 at 20:19
  • What is the criteria to skip? – Padraic Cunningham Oct 14 '15 at 20:22
  • @PadraicCunningham the data is being read from a sensor, the file that is written contains the sensor such as `(1,1), ...`, and second row is the column name. I don't care about those for the time being, I just need the data saved to a dataframe. – Leb Oct 14 '15 at 20:24
  • So you want every four entries skipping two in between? – Padraic Cunningham Oct 14 '15 at 20:27
  • Yep, I tried to create a formula for that but wasn't successful either. I included that explanation in my edit. – Leb Oct 14 '15 at 20:28
  • So you don't mind storing all the data in a list first? – Padraic Cunningham Oct 14 '15 at 20:34
  • @PadraicCunningham that won't be an issue. – Leb Oct 14 '15 at 20:35
  • Assuming you're using the `read_csv` function, you could use the `comment` param. However, it only allows a single character, and not a list of characters. :/ – James Mertz Oct 14 '15 at 20:37
  • Are you just skipping every 4 rows like [0, 1, 6, 7.....] etc? You could just generate the list or rows to skip and pass this to `skiprows` as an arg to `read_csv` – EdChum Oct 14 '15 at 20:40
  • @EdChum yes, those rows are unnecessary – Leb Oct 14 '15 at 20:41
  • So if you know the number of lines in the file then you could something like `a = list(range(len_of_file)) rows = sorted(a[::6] + a[1::6])` this creates a list of the pairs of rows to skip and then `pd.read_csv(file, skiprows=rows)` – EdChum Oct 14 '15 at 20:43

5 Answers5

1

My suggestion would be to just pre-scrub the file before hand:

with open("file.csv") as rp, open("outfile.csv", 'w') as wp:
    for line in rp:
        if 'skip' not in line:
            wp.write(line)
James Mertz
  • 8,459
  • 11
  • 60
  • 87
  • I would rather not double my data, I edited my question. That would not be efficient. I didn't downvote tho. – Leb Oct 14 '15 at 20:22
  • @Leb I think you're going to have to. This is actually memory efficient solution. – Andy Hayden Oct 14 '15 at 20:24
  • @Leb the way I see it is, either you're going to have to create a separate file, or generate a separate "in-memory" file. I don't know of any way to have `pandas` skip lines based on a specific criteria. – James Mertz Oct 14 '15 at 20:24
1

presuming you want to take sections of four lines coming after the two lines to skip, just skip two lines and take a slice of four rows from a csv reader obejct:

from itertools import islice, chain
import pandas as pd
import csv


def parts(r):
    _, n = next(r), next(r)
    while n:
        yield islice(r, 4)
        _, n = next(r, ""), next(r, "")
            _, n = next(r, ""), next(r, "")


with open("test.txt")as f:
        r = csv.reader(f)
        print(pd.DataFrame(list(chain.from_iterable(parts(r)))))

Output:

    0   1   2
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12

Or pass the chain object to pd.DataFrame.from_records:

with open("test.txt")as f:
    r = csv.reader(f)
    print(pd.DataFrame.from_records(chain.from_iterable(parts(r))))

    0   1   2
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12

Or a more general approach using a function using the consume recipe to skip lines:

from itertools import islice, chain
from collections import deque
import pandas as pd
import csv

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)


def parts(r, sec_len, skip):
    consume(r,skip)
    for sli in iter(lambda: list(islice(r, sec_len)), []):
        yield sli
        consume(r, skip)


with open("test.txt")as f:
    r = csv.reader(f)
    print(pd.DataFrame.from_records((chain.from_iterable(parts(r, 4, 2)))))

The last option is to write to an StringIo object and pass that:

from io import StringIO
def parts(r, sec_len, skip):
    consume(r, skip)
    for sli in iter(lambda: list(islice(r, sec_len)), []):
        yield "".join(sli)
        consume(r, skip)


with open("test.txt")as f:
    so = StringIO()
    so.writelines(parts(f, 4, 2))
    so.seek(0)
    print(pd.read_csv(so, header=None))
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Thank you for adding a general approach as well. – Leb Oct 14 '15 at 20:52
  • No prob, consume is only going to be useful if you are skipping multiple lines but thought it was worth throwing in – Padraic Cunningham Oct 14 '15 at 20:55
  • Any idea why it's causing a `MemoryError` when doing it on actual file. It's 265M and about 7.8M rows. I've opened larger files (size wise) but do you think the error is due to rows and doing `from_records`. That's where the error is occuring. – Leb Oct 14 '15 at 21:11
  • If you create a DataFrame calling list on the chain `pd.DataFrame(list(chain.from_iterable(parts(r))))` do you also get a memory error? – Padraic Cunningham Oct 14 '15 at 21:12
  • Yes, that one as well. – Leb Oct 14 '15 at 21:17
  • How about specifying a chunk size, something like `iterator=True, chunksize=1000` and use concat? We are not storing a great deal in memory at all so not sure why it is causing a memoryerror. Do you see your memory usage spiking? – Padraic Cunningham Oct 14 '15 at 21:39
1

One method is to just generate the list of row numbers to skip, so determine the number of rows in the file using method here: Count how many lines are in a CSV Python?

then do the following:

In [16]:
import io
​import pandas as pd
t="""(1 1),skip,this
skip,this,too
1,2,3
4,5,6
7,8,9
10,11,12
(1 2),skip,this
skip,this,too"""
# generate initial list, using 10 her but you can get the number of rows using another metho
a = list(range(10))
# generate the pairs of rows to skips in steps
rows = sorted(a[::6] + a[1::6])
# now read it in
pd.read_csv(io.StringIO(t), skiprows=rows, header=None)

Out[16]:
    0   1   2
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12
Community
  • 1
  • 1
EdChum
  • 376,765
  • 198
  • 813
  • 562
1

Since you have a repeating pattern (toss two lines, keep four) I might do something like this:

from io import BytesIO
with open("skipper.csv", "rb") as fp:
    lines = (line for i,line in enumerate(fp) if i % 6 >= 2)
    df = pd.read_csv(BytesIO(b''.join(lines)), header=None)

which gives me

>>> df
    0   1   2
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12
DSM
  • 342,061
  • 65
  • 592
  • 494
-1

I would pipe out to awk to do the filtering on the fly:

import subprocess
import pandas as pd

cmd = "awk '(NR - 1) % 6 > 1' test.csv"
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
df = pd.read_csv(proc.stdout, header=None)

The awk command will skip the first two lines out of every group of 6. Since the filtering is done on a streaming basis, this is very memory efficient, and since the filtering is done in a separate process, it'll be fast as well.

Evan Wright
  • 680
  • 5
  • 9