5

I have an sframe as such:

+---------+------+-------------------------------+-----------+------------------+
| term_id | lang |            term_str           | term_type | reliability_code |
+---------+------+-------------------------------+-----------+------------------+
| IATE-14 |  ro  |    Agenție de aprovizionare   |  fullForm |        3         |
| IATE-84 |  bg  | компетенции на държави чле... |  fullForm |        3         |
| IATE-84 |  cs  | příslušnost členských stát... |  fullForm |        3         |
| IATE-84 |  da  |     medlemsstatskompetence    |  fullForm |        3         |
| IATE-84 |  de  | Zuständigkeit der Mitglied... |  fullForm |        3         |
| IATE-84 |  el  | αρμοδιότητα των κρατών μελ... |  fullForm |        3         |
| IATE-84 |  en  | competence of the Member S... |  fullForm |        3         |
| IATE-84 |  es  | competencias de los Estado... |  fullForm |        3         |
| IATE-84 |  et  |     liikmesriikide pädevus    |  fullForm |        3         |
| IATE-84 |  fi  |   jäsenvaltioiden toimivalta  |  fullForm |        3         |
| IATE-84 |  fr  |  compétence des États membres |  fullForm |        3         |
| IATE-84 |  ga  |    inniúlacht na mBallstát    |  fullForm |        3         |
| IATE-84 |  hu  |       tagállami hatáskör      |  fullForm |        3         |
| IATE-84 |  it  | competenza degli Stati membri |  fullForm |        3         |
| IATE-84 |  lt  |  valstybių narių kompetencija |  fullForm |        2         |
| IATE-84 |  lv  |     dalībvalstu kompetence    |  fullForm |        3         |
| IATE-84 |  nl  |  bevoegdheid van de lidstaten |  fullForm |        3         |
| IATE-84 |  pl  | kompetencje państw członko... |  fullForm |        3         |
| IATE-84 |  pt  | competência dos Estados-Me... |  fullForm |        3         |
| IATE-84 |  ro  | competența statelor membre... |  fullForm |        3         |
+---------+------+-------------------------------+-----------+------------------+

I need to extract all the rows where lang == 'de' or lang == 'en' but the rows I extract with lang == 'en' needs to have a corresponding lang == 'de' such that they share the same term_id.

I have been doing it as such with graphlab and sframe:

sf = gl.SFrame.read_csv('iate.csv', delimiter='\t', quote_char='\0', column_type_hints=[str,str,unicode,str,int])
de = sf[sf['lang'] == 'de']
de_termids = de['term_id']

and de.print_rows(10):

+------------+------+-------------------------------+-----------+------------------+
|  term_id   | lang |            term_str           | term_type | reliability_code |
+------------+------+-------------------------------+-----------+------------------+
|  IATE-84   |  de  | Zuständigkeit der Mitglied... |  fullForm |        3         |
|  IATE-290  |  de  | Schutz der öffentlichen Ge... |  fullForm |        3         |
|  IATE-662  |  de  | mengenmäßigen Ausfuhrbesch... |  fullForm |        3         |
|  IATE-801  |  de  |     Eintragungshindernisse    |  fullForm |        2         |
| IATE-1326  |  de  | Sonderregelung für Reisebü... |  fullForm |        4         |
| IATE-1702  |  de  |          Erwerbslose          |  fullForm |        4         |
| IATE-2818  |  de  |    Verwaltungsvorschriften    |  fullForm |        3         |
| IATE-21139 |  de  |    frisches Obst und Gemüse   |  fullForm |        3         |
| IATE-21563 |  de  | chemische Erzeugnisse zur ... |  fullForm |        3         |
| IATE-21564 |  de  |         Mineralsäuren         |  fullForm |        3         |
+------------+------+-------------------------------+-----------+------------------+

And then:

en = sf[sf['lang'] == 'en']
en.print_rows(10)

[out]:

+------------+------+-------------------------------+--------------+------------------+
|  term_id   | lang |            term_str           |  term_type   | reliability_code |
+------------+------+-------------------------------+--------------+------------------+
|  IATE-84   |  en  | competence of the Member S... |   fullForm   |        3         |
|  IATE-254  |  en  | award of public works cont... |   fullForm   |        3         |
|  IATE-290  |  en  |    public health protection   |   fullForm   |        3         |
|  IATE-662  |  en  | quantitative restriction o... |   fullForm   |        3         |
|  IATE-801  |  en  |      grounds for refusal      |   fullForm   |        2         |
| IATE-1299  |  en  |              CEP              | abbreviation |        3         |
| IATE-1326  |  en  | special scheme for travel ... |   fullForm   |        3         |
| IATE-2818  |  en  |          regulations          |   fullForm   |        3         |
| IATE-7128  |  en  |          company name         |   fullForm   |        2         |
| IATE-21139 |  en  |  fresh fruits and vegetables  |   fullForm   |        3         |
+------------+------+-------------------------------+--------------+------------------+

I have tried:

en_de = en[en['term_id'] in de_termids]

But I'm getting the syntax wrong, giving me this error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-12-9656091794b8> in <module>()
      1 en = sf[sf['lang'] == 'en']
----> 2 en_de = en[en['term_id'] in de_termids]

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __contains__(self, item)
    691 
    692         """
--> 693         return (self == item).any()
    694 
    695     def contains(self, item):

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __eq__(self, other)
    973                 return SArray(_proxy = self.__proxy__.vector_operator(other.__proxy__, '=='))
    974             else:
--> 975                 return SArray(_proxy = self.__proxy__.left_scalar_operator(other, '=='))
    976 
    977 

/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Array size mismatch

How should I filter the sframe such that I get rows with en and de and corresponding term_id?

The resulting dataframe should look something like this:

+---------+-----------------+-------------+
| term_id |     term_str_en | term_str_de | 
+---------+-------------------------------+
| IATE-999 |    something    |  etwas      |
...
+---------+-----------------+-------------+

How do I do the same with pandas?

alvas
  • 115,346
  • 109
  • 446
  • 738

4 Answers4

2

Suppose you already have two data frames with filtered data for both languages: df_en and df_de. Then you can merge them:

new_df = pd.merge(df_en[['term_id','term_str']], df_de[['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))

The method inner takes care of skipping all unmatched rows. You can find more options to merge in pandas docs and refs

Edit

The same result without creating two data frames (df is the original data frame with all entries, possibly containing other languages too):

new_df = pd.merge(df.loc[df['lang']=='en',['term_id','term_str']], df.loc[df['lang']=='de',['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))
ptrj
  • 5,152
  • 18
  • 31
  • Thanks! Is there a way to do it without creating `df_en` and `df_de` first? – alvas Apr 20 '16 at 03:04
  • 1
    @alvas Yes. Instead of `df_en` you can use `df[df['lang']=='en']` and similarly for `de`. I'll edit the answer in a second. Basically, it's the same procedure but without explicitly creating new data frames. – ptrj Apr 20 '16 at 03:14
  • Glad I could help. – ptrj Apr 20 '16 at 03:23
1

Since you've only gotten answers for pandas, here's how to do it in SFrame, given de_termids and en in your code samples:

en.filter_by(de_termids, 'term_id')
Evan Samanas
  • 516
  • 3
  • 6
0

do you mean that, for each term_id, the lang should be 'de', or ('de', 'en') but not 'en' only? if so,

I think you can select lang = 'de' or 'en', and filter out term_id which has only 'en' in the lang

PhilChang
  • 2,591
  • 1
  • 16
  • 18
  • This works: `sf_deen = sf[sf['lang'] == ('de' or 'en')]` to get the rows with de and en but they're not aligned by the `term_id` to achieve the desired data frame as shown in the question =( – alvas Apr 20 '16 at 01:32
0

I am on a phone at the moment, but I think you want to take advantage of the index alignment and do joins.

eg.

de_terms = sf [sf ['lang'] == 'de'].set_index ('term_id')

This will make term_id the index for the dataframe. Do the same for en, and then assign the term column from one to the other under a new name.

Normally, the destination determines what indexes are included, so e.g. if you only want to include entries with 'en' if they also have 'de', but all 'de', merge the en column into the de frame; indexes only in the en frame will get dropped. or, you can do filtering to get boolean combinations of no nan.

Corley Brigman
  • 11,633
  • 5
  • 33
  • 40