2

I have a 5GB file of businesses and I'm trying to extract all the businesses that whose business type codes (SNACODE) start with the SNACODE corresponding to grocery stores. For example, SNACODEs for some businesses could be 42443013, 44511003, 44419041, 44512001, 44522004 and I want all businesses whose codes start with my list of grocery SNACODES codes = [4451,4452,447,772,45299,45291,45212]. In this case, I'd get the rows for 44511003, 44512001, and 44522004

Based on what I googled, the most efficient way to read in the file seemed to be one row at a time (if not the SQL route). I then used a for loop and checked if my SNACODE column started with any of my codes (which probably was a bad idea but the only way I could get to work).

I have no idea how many rows are in the file, but there are 84 columns. My computer was running for so long that I asked a friend who said it should only take 10-20 min to complete this task. My friend edited the code but I think he misunderstood what I was trying to do because his result returns nothing.

I am now trying to find a more efficient method than re-doing my 9.5 hours and having my laptop run for an unknown amount of time. The closest thing I've been able to find is most efficient way to find partial string matches in large file of strings (python), but it doesn't seem like what I was looking for.

Questions:

What's the best way to do this? How long should this take?
Is there any way that I can start where I stopped? (I have no idea how many rows of my 5gb file I read, but I have the last saved line of data--is there a fast/easy way to find the line corresponding to a unique ID in the file without having to read each line?)

This is what I tried -- in 9.5 hours it outputted a 72MB file (200k+ rows) of grocery stores

    codes = [4451,4452,447,772,45299,45291,45212]  #codes for grocery stores
    for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1):
        data = np.asarray(df)
        data = pd.DataFrame(data, columns = headers)
        for code in codes:
            if np.char.startswith(str(data["SNACODE"][0]), str(code)):
                with open("grocery.csv", "a") as myfile:
                    data.to_csv(myfile, header = False)
                    print code
                break  #break code for loop if match

    grocery.to_csv("grocery.csv", sep = '\t')

This is what my friend edited it to. I'm pretty sure the x = df[df.SNACODE.isin(codes)] is only matching perfect matches, and thus returning nothing.

    codes = [4451,4452,447,772,45299,45291,45212]  
    matched = []
    for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1024*1024, dtype = str, low_memory=False):
        x = df[df.SNACODE.isin(codes)]
        if len(x):
            matched.append(x)
    print "Processed chunk and found {} matches".format(len(x))
  
    output = pd.concat(matched, axis=0)
    output.to_csv("grocery.csv", index = False)

Thanks!

Community
  • 1
  • 1
user3768258
  • 125
  • 1
  • 10
  • 3
    You should create a test file where it will only run for a minute or two and then you can test algorithms to see which one is the most efficient. – kyle heitman Nov 02 '15 at 06:55
  • For simple string matching, you can simplify a lot by dropping Pandas from the requirements. – tripleee Nov 02 '15 at 06:57
  • If this is a one-off, I don't think it's worth exploring how to continue from somewhere in the middle. Just rerun from start to finish. – tripleee Nov 02 '15 at 07:00
  • http://stackoverflow.com/questions/17957890/pandas-select-from-dataframe-using-startswith – tripleee Nov 02 '15 at 07:06

2 Answers2

2

To increase speed you could pre-build a single regexp matching the lines you need and the read the raw file lines (no csv parsing) and check them with the regexp...

codes = [4451,4452,447,772,45299,45291,45212]
col_number = 4 # Column number of SNACODE
expr = re.compile("[^,]*," * col_num +
                  "|".join(map(str, codes)) +
                  ".*")
for L in open('infogroup_bus_2010.csv'):
    if expr.match(L):
        print L

Note that this is just a simple sketch as no escaping is considered... if the SNACODE column is not the first one and preceding fields may contain a comma you need a more sophisticated regexp like:

...
'([^"][^,]*,|"([^"]|"")*",)' * col_num +
...

that ignores commas inside double-quotes

6502
  • 112,025
  • 15
  • 165
  • 265
  • sorry, i'm confused about your last comment: "Note that this is just a simple sketch as no escaping is considered... if the SNACODE column is not the first one and preceding fields may contain a comma you need a more sophisticated regexp." Can you please elaborate? – user3768258 Nov 02 '15 at 15:46
  • @user3768258: the regexp that this code is building to skip over the fields that are preceding the SNACODE field is `[^,]*,[^,]*,...` i.e. a sequence of zero or more non-commas followed by a comma. If the field content however is for example `"hey, dude"` the comma is quoted and shouldn't be considered a field separator. It's possible of course to write a regexp that will skip quoted commas but it's more complex that just `[^,]*`. – 6502 Nov 02 '15 at 15:52
  • in general, is it faster to to save L as a new data frame and then write it at the end as one big df or write every line as i get it? thanks! – user3768258 Nov 02 '15 at 15:57
  • Python buffering is ok... this kind of processing (reading one line, writing one line) is normally reasonably fast and all the time consuming stuff is done in C. I wouldn't expect a long computation time for a 5Gb input (for sure not hours). For get the maximum speed however I'd go for a C++ version using `mmap` to access the file content... for this kind of processing I'd expect the speed for that to be basically the same speed as just copying the 5Gb file. If many different searches are needed on the same data probably building an index is a better approach... – 6502 Nov 02 '15 at 16:01
2

You can probably make your pandas solution much faster:

codes = [4451, 4452, 447, 772, 45299, 45291, 45212]
codes = [str(code) for code in codes]

sna = pd.read_csv('infogroup_bus_2010.csv', usecols=['SNACODE'], 
                  chunksize=int(1e6), dtype={'SNACODE': str})

with open('grocery.csv', 'w') as fout:
    for chunk in sna:
        for code in chunk['SNACODE']:
            for target_code in codes:
                if code.startswith(target_code):
                    fout.write('{}\n'.format(code))

Read only the needed column with usecols=['SNACODE']. You can adjust the chunk size with chunksize=int(1e6). Depending on your RAM you can likely make it much bigger.

Mike Müller
  • 82,630
  • 20
  • 166
  • 161