1

i want to read an XML dump page by page by writing java code, but i have this Scala code, and i do not understand it to rewrite it so what is the similar java code. i know it can run properly on JVM, but i want some thing i understand.

import edu.umd.cloud9.collection.wikipedia.language._
import edu.umd.cloud9.collection.wikipedia._
 def wikiXmlToPlainText(xml: String): Option[(String, String)] = {
 val page = new EnglishWikipediaPage()
   WikipediaPage.readPage(page, xml)
    if (page.isEmpty) None
       else Some((page.getTitle, page.getContent))
        }
          val plainText = rawXmls.flatMap(wikiXmlToPlainText)
user283686
  • 41
  • 3

1 Answers1

1

I can't tell what the type of "rawXmls" is, some sort of collection of Strings I'm guessing. The following should be a conversion of the wikiXmlToPlainTextUtil method (more or less) that returns a java Optional of a List of Strings, instead of a tuple. I'll leave the exercise of applying this to a stream to you, this answer might be helpful for that.

import edu.umd.cloud9.collection.wikipedia.language.*
import edu.umd.cloud9.collection.wikipedia.*

class wikiXmlToPlainTextUtil { 
  Optional<List<String>> wikiXmlToPlainText(String xml) {
    EnglishWikipediaPage page = new EnglishWikipediaPage()
    WikipediaPage.readPage(page, xml)
    if (page.isEmpty) {
      return Optional.empty();
    } else {
      List<String> result = new ArrayList<>();
      result.add(page.getTitle);
      result.add(page.getContent);
      return Optional.of(result);
    }
  }     
}
Community
  • 1
  • 1
Angelo Genovese
  • 3,398
  • 17
  • 23
  • I'd actually suggest creating a java bean with title and body properties, then populating that rather than using the `List`, again left as an exercise for the reader. – Angelo Genovese Jun 08 '16 at 18:40
  • Actually, `List(1, 2, 3).flatMap(a => Some(a).filter(_ % 2 == 0))` compiles and returns `List(2)`. there is no indication at all in the original code of the type for either `rawXmls` or `plainText` so I stand by what I said. the parameter for `wikiXmlToPlainText` tells us it is a `Something[String]` the return type tells us it is not strictly a Monad but likely one of the standard collections. RDD from spark also allows any `A => Traversable[B]` as the function passed to flatMap, for example. – Angelo Genovese Jun 08 '16 at 20:10
  • thank you, and yes i need to use it as JavaRDD or as JavaRDD > with string1 as the page title, and String2 as the page body – user283686 Jun 08 '16 at 23:35
  • In that case you can replace the List with a simple tuple2 constructor, and the Optional with scala's Option class. – Angelo Genovese Jun 09 '16 at 01:18
  • @Angelo Genovese : Sorry, you're right! Type is not as certain as I stated. – Det Jun 09 '16 at 16:22