2

I'm trying to generate some summary data on a set, so I don't care about the numbers themselves I only care about the exponents- the goal is to find the total number of 7-digit numbers (ex. phone numbers). The way I'm handling this currently is pretty simplistic

I have a data set in CSV, it looks something like this:

"1.108941100000000000e+07, 4.867837000000000000e+06, ... "

# numlist is the dataset

x = np.trunc(np.log10(numlist))    
total = (x == 6).sum()

And that gives me the number of 7 digit numbers. When I chose that approach I assumed the inputs would be a list of integers but now I am seeing the data could actually be given/stored in scientific notation. If it was given in scientific notation is there a faster way to achieve the same results? Is there a way that I can only load the exponents in from the csv file and skip the log10 behavior entirely?

Also, I'm not limited to using numpy arrays but after some experimentation they were the fastest implementation for my purpose.

smpat04
  • 57
  • 7
  • Do you have performance problems? – Daniel Jun 09 '18 at 20:47
  • 1
    Not necessarily but I'd like to be able to code as efficiently as possible. This data set only has 1.5m lines but what if it had 150m lines? I would like to at least understand in principle how/if it can be done. – smpat04 Jun 09 '18 at 20:52
  • Maybe I am missing something, but does `np.log10(x)` care whether `x` is an integer or a float? It seems that `numlist` must already have been converted to a number from a string (I assume). If you know that all input data _always_ contain the exponent `e` then probably you could use @James answer to count `e+0` in each line bypassing full-blown parsing/conversion to numbers. – AGN Gazer Jun 10 '18 at 03:07
  • Somewhat related: https://stackoverflow.com/q/18152597/8033585 – AGN Gazer Jun 10 '18 at 03:19
  • If the number of digits is important, I'd suggest counting the characters in the string format. The string format is the only thing the will work if the first digit might be `0'`. – hpaulj Jun 10 '18 at 04:27
  • numpy has a `fromregex` function that may be of help but you need to load the whole file. i think that `mmap` lets you read files by chunks without loading the file but i don't know how to do it and if it can fit your needs – bobrobbob Jun 10 '18 at 08:56

1 Answers1

1

You may want to write a custom parser to use when reading the file rather than reading in all of the data just to toss it away later.

Count of exponents of size n

def count_exponents(path, n):
    n_str = 'e+0' + str(n)
    out = 0
    with open(path) as fp:
        for line in fp:
            out += line.count(n_str)
    return out

Return exponents

import re
pattern = re.compile('e([+\-]\d+)')

def get_exponents(path):
    with open(path) as fp:
        out = [pattern.findall(line) for line in fp]
    return out
James
  • 32,991
  • 4
  • 47
  • 70
  • Thanks for taking a look. – smpat04 Jun 10 '18 at 14:23
  • Thanks for taking a look. I am calling this in this way: `x = get_exponents('PhoneData.csv') print(x)` and it is returning a list of empty lists [[ ], [ ], [ ], .... , [ ]] I assume I am doing this incorrectly. Sorry for the amateurish questions- I am an amateur :) – smpat04 Jun 10 '18 at 14:28