2

I have a big text file (300000000 rows) but it is full of undesired data, which I would like to remove. Those data are the one containing the string "0000e".

I tried:

f=pd.read_csv('File.txt', skiprows=139, header=None, index_col=False)
f=f.iloc[:,0]
f1=f[f.str.contains("0000e")==False]

and

f=pd.read_csv('file.txt', skiprows=139, header=None, index_col=False, chunksize=50)
dfs = pd.concat([x[x[0].str.endswith('000e')==False] for x in f])

but it is rather long, is there a faster way to skip some lines containing a certain string? Perhaps with na_values?

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
JRLCAR
  • 21
  • 1
  • 3
  • What are your intentions with these rows? You could also use the `memory_map` option to improve performance in the loading stage and possibly other stages/. – Jab Mar 28 '19 at 18:33
  • To speed things up have you tried just reading the .txt file into a file object (not via pandas in this case) and then save each line to a string list. Excluding the lines that contain '000e' as you add them to the list. Could be a tuple even. – RockAndRoleCoder Mar 28 '19 at 18:34
  • Then afterwords, if your file needs more analysis, you can save it to a dataframe and proceed. – RockAndRoleCoder Mar 28 '19 at 18:34

2 Answers2

3

I prefer your first attempt more as it is definitely more readable atop the fact that your second line has x's and I don't know what they refer to.

That said, using memory_map=True will boost the performance as noted in the docs, you can also gain an extra advantage by removing the second line and accessing the column in the same line you create the df. Lastly, replacing the check ...==False with ~... may provide some benefit. as ~ is a logical not but you need to filter out all the NaN values or you get an error. Luckily Series.str.contains accepts and na attribute appliying the given function to NaN values.

import pandas as pd

df = pd.read_csv('File.txt', memory_map=True, header=None, index_col=False).iloc[:,0]
df1 = df[~df.str.contains("test", na=False)]
#if you want to also skip NaN rows use the below statement
df1 = df[~df.str.contains("test", na=False)].dropna()

Alternatively, doing this using csv is much faster even if you decide to load it into pandas afterwards. I don't know what your data looks like but I tested these with a csv file cointaining 3 columns and 100 rows and I got roughly 9x better performance. This probably won't correlate to you're results but this is definitely the method I would choose if I were you.

from csv import reader

filter = '0000e' #so we aren't making a new string every iteration
with open('File.txt', 'r') as f:
  df = pd.DataFrame(first for first, *_ in reader(f) if filter not in first)
  #if you want to skip NaN rows
  ...(first for first, *_ in reader(f) if not first and filter not in first)
  #take note this also skips empty strings, use if first is not None for only skipping NaN values
Jab
  • 26,853
  • 21
  • 75
  • 114
  • 1
    Thank you very much! I tried method 2 and it works faster. I have now a DataFrame of 1999048 rows and facing a new speed issue for the processing of my data. I'll post a new question.Thank you again for your answer! – JRLCAR Mar 29 '19 at 10:19
1

if you have access to a linux or mac os system, you can do this in a pre-processing step that is probably much faster with grep -v, which returns all lines that do not match

grep -v 0000e File.txt > small_file.txt

on windows (I think) it's findstring /v

findstring /v File.txt > small_file.txt

you can call the os command from inside your python code, see here

and if you want to make it able to handle multiple os'es, see here

philshem
  • 24,761
  • 8
  • 61
  • 127