I have an sframe
as such:
+---------+------+-------------------------------+-----------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+---------+------+-------------------------------+-----------+------------------+
| IATE-14 | ro | Agenție de aprovizionare | fullForm | 3 |
| IATE-84 | bg | компетенции на държави чле... | fullForm | 3 |
| IATE-84 | cs | příslušnost členských stát... | fullForm | 3 |
| IATE-84 | da | medlemsstatskompetence | fullForm | 3 |
| IATE-84 | de | Zuständigkeit der Mitglied... | fullForm | 3 |
| IATE-84 | el | αρμοδιότητα των κρατών μελ... | fullForm | 3 |
| IATE-84 | en | competence of the Member S... | fullForm | 3 |
| IATE-84 | es | competencias de los Estado... | fullForm | 3 |
| IATE-84 | et | liikmesriikide pädevus | fullForm | 3 |
| IATE-84 | fi | jäsenvaltioiden toimivalta | fullForm | 3 |
| IATE-84 | fr | compétence des États membres | fullForm | 3 |
| IATE-84 | ga | inniúlacht na mBallstát | fullForm | 3 |
| IATE-84 | hu | tagállami hatáskör | fullForm | 3 |
| IATE-84 | it | competenza degli Stati membri | fullForm | 3 |
| IATE-84 | lt | valstybių narių kompetencija | fullForm | 2 |
| IATE-84 | lv | dalībvalstu kompetence | fullForm | 3 |
| IATE-84 | nl | bevoegdheid van de lidstaten | fullForm | 3 |
| IATE-84 | pl | kompetencje państw członko... | fullForm | 3 |
| IATE-84 | pt | competência dos Estados-Me... | fullForm | 3 |
| IATE-84 | ro | competența statelor membre... | fullForm | 3 |
+---------+------+-------------------------------+-----------+------------------+
I need to extract all the rows where lang == 'de' or lang == 'en'
but the rows I extract with lang == 'en'
needs to have a corresponding lang == 'de'
such that they share the same term_id
.
I have been doing it as such with graphlab
and sframe
:
sf = gl.SFrame.read_csv('iate.csv', delimiter='\t', quote_char='\0', column_type_hints=[str,str,unicode,str,int])
de = sf[sf['lang'] == 'de']
de_termids = de['term_id']
and de.print_rows(10)
:
+------------+------+-------------------------------+-----------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+------------+------+-------------------------------+-----------+------------------+
| IATE-84 | de | Zuständigkeit der Mitglied... | fullForm | 3 |
| IATE-290 | de | Schutz der öffentlichen Ge... | fullForm | 3 |
| IATE-662 | de | mengenmäßigen Ausfuhrbesch... | fullForm | 3 |
| IATE-801 | de | Eintragungshindernisse | fullForm | 2 |
| IATE-1326 | de | Sonderregelung für Reisebü... | fullForm | 4 |
| IATE-1702 | de | Erwerbslose | fullForm | 4 |
| IATE-2818 | de | Verwaltungsvorschriften | fullForm | 3 |
| IATE-21139 | de | frisches Obst und Gemüse | fullForm | 3 |
| IATE-21563 | de | chemische Erzeugnisse zur ... | fullForm | 3 |
| IATE-21564 | de | Mineralsäuren | fullForm | 3 |
+------------+------+-------------------------------+-----------+------------------+
And then:
en = sf[sf['lang'] == 'en']
en.print_rows(10)
[out]:
+------------+------+-------------------------------+--------------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+------------+------+-------------------------------+--------------+------------------+
| IATE-84 | en | competence of the Member S... | fullForm | 3 |
| IATE-254 | en | award of public works cont... | fullForm | 3 |
| IATE-290 | en | public health protection | fullForm | 3 |
| IATE-662 | en | quantitative restriction o... | fullForm | 3 |
| IATE-801 | en | grounds for refusal | fullForm | 2 |
| IATE-1299 | en | CEP | abbreviation | 3 |
| IATE-1326 | en | special scheme for travel ... | fullForm | 3 |
| IATE-2818 | en | regulations | fullForm | 3 |
| IATE-7128 | en | company name | fullForm | 2 |
| IATE-21139 | en | fresh fruits and vegetables | fullForm | 3 |
+------------+------+-------------------------------+--------------+------------------+
I have tried:
en_de = en[en['term_id'] in de_termids]
But I'm getting the syntax wrong, giving me this error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-12-9656091794b8> in <module>()
1 en = sf[sf['lang'] == 'en']
----> 2 en_de = en[en['term_id'] in de_termids]
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __contains__(self, item)
691
692 """
--> 693 return (self == item).any()
694
695 def contains(self, item):
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __eq__(self, other)
973 return SArray(_proxy = self.__proxy__.vector_operator(other.__proxy__, '=='))
974 else:
--> 975 return SArray(_proxy = self.__proxy__.left_scalar_operator(other, '=='))
976
977
/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Array size mismatch
How should I filter the sframe such that I get rows with en
and de
and corresponding term_id
?
The resulting dataframe should look something like this:
+---------+-----------------+-------------+
| term_id | term_str_en | term_str_de |
+---------+-------------------------------+
| IATE-999 | something | etwas |
...
+---------+-----------------+-------------+
How do I do the same with pandas
?