- I have a file of 300m lines (inputFile), all with 2 columns separated by a tab.
- I also have a list of 1000 unique items (vals).
I want to create a dictionary with column 1 as key and column 2 as value for all lines in inputFile where the first columns occurs in vals. A few items in vals do not occur in the file, these values have to be saved in a new list. I can use up to 20 threads to speed up this process.
What is the fastest way to achieve this?
My best try till now:
newDict = {}
foundVals = []
cmd = "grep \"" + vals[0]
for val in vals:
cmd = cmd + "\|^"+val+"[[:space:]]"
cmd = cmd + "\" " + self.inputFile
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in iter(p.stdout.readline, ''):
info = line.split()
foundVals.append(info[0])
newDict.update({info[0]:info[1]})
p.wait()
notFound = [x for x in vals if x not in set(foundVals)]
Example inputFile:
2 9913
3 9913
4 9646
...
594592886 32630
594592888 32630
594592890 32630
vals:
[1,2,594592888]
wanted dictionary:
{2:9913,594592888:32630}
And in notFound:
[1]