Is there any way to get the conceptnet dump file that only contains English nodes

Asked Jan 01 '23 at 11:11

Active Jan 01 '23 at 11:11

Viewed 111 times

I tried to download the whole dump file (assertions) from the https://github.com/commonsense/conceptnet5/wiki/Downloads This file is extremely large (~10GB) and then I write a python script to filter out non-English nodes:

FILE = 'conceptnet-assertions-5.7.0.csv'
data = pd.read_csv(FILE, delimiter='\t')
data.columns = ['uri', 'relation', 'start', 'end', 'json']
# delete non-english nodes
data = data[data['start'].apply(lambda row: row.find('en') > 0) & data['end'].apply(lambda row: row.find('en') > 0)]
data.index = range(data.shape[0])
print(data)

However, employing pandas to read this large csv file is very time-consuming and even after a long time I still cannot get the result. In this case, I wonder if there is any effecient way to filter out non-english nodes?

asked Jan 01 '23 at 11:11

Yuki Wang

2

You can use ```csv.reader()```as an iterator over rows and load them lazily – forecastman Jan 01 '23 at 11:20

Is there any way to get the conceptnet dump file that only contains English nodes

0 Answers0