0

I tried to download the whole dump file (assertions) from the https://github.com/commonsense/conceptnet5/wiki/Downloads This file is extremely large (~10GB) and then I write a python script to filter out non-English nodes:

FILE = 'conceptnet-assertions-5.7.0.csv'
data = pd.read_csv(FILE, delimiter='\t')
data.columns = ['uri', 'relation', 'start', 'end', 'json']
# delete non-english nodes
data = data[data['start'].apply(lambda row: row.find('en') > 0) & data['end'].apply(lambda row: row.find('en') > 0)]
data.index = range(data.shape[0])
print(data) 

However, employing pandas to read this large csv file is very time-consuming and even after a long time I still cannot get the result. In this case, I wonder if there is any effecient way to filter out non-english nodes?

Yuki Wang
  • 85
  • 8

0 Answers0