-2

I have several lines and documents of text that contain lines like these. I want to extract data that occurs after "pn=" and put it in a map where in the case below group becomes my key and Fulton_County_Grand_Jury the value. Need help with building a regex to extract this.

 <wf cmd=done rdf=group pos=NNP lemma=group wnsn=1 lexsn=1:03:00::
 pn=group>Fulton_County_Grand_Jury</wf>
serendipity
  • 852
  • 13
  • 32
  • 1
    Why are you using a regex to read XML ? There are native XML tools that you could use – AxelH Feb 02 '17 at 11:41
  • Better use an XML parser, not a regex, to process XML, unless you have full control over the documents and know for certain that they do not have comments, nested structures, and other structural issues that make the XML language irregular. – RealSkeptic Feb 02 '17 at 11:43
  • 1
    Here is a good start to see what exist nativly on [tutorialspoint.com](https://www.tutorialspoint.com/java_xml/) – AxelH Feb 02 '17 at 11:44
  • 2
    http://stackoverflow.com/a/1732454/2545439 ;) – Pieter De Bie Feb 02 '17 at 12:21
  • 1
    @PieterDeBie That was a great read! In my defense, I also saw this - "If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. " :) – serendipity Feb 02 '17 at 12:42

2 Answers2

1

Use a regex, with this pattern: "pn=(.*?)>"

    final String hex = "<wf cmd=done rdf=group pos=NNP lemma=group wnsn=1 lexsn=1:03:00:: pn=group>Fulton_County_Grand_Jury</wf>";
    final Matcher m = Pattern.compile("pn=(.*?)>").matcher(hex);
    while (m.find()) {
        System.out.println(m.group(1));
    }
ΦXocę 웃 Пepeúpa ツ
  • 47,427
  • 17
  • 69
  • 97
  • Neither =, nor <, or > are meta characters on its own. See https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html – Olaf Dietsche Feb 02 '17 at 11:50
  • Yes, provided pn= is always exactly before the closing >. – Olaf Dietsche Feb 02 '17 at 11:54
  • @ΦXocę웃Пepeúpaツ Thanks for the answer! Both the regexes you shared work and m.group(1) gives me the key value i.e. group. Still need the value. Maybe use substring for that? – serendipity Feb 02 '17 at 11:57
1

The most reliable way would be to use an XML parser.

Apart from that, you have to look for pn=, its end, and the part between > and <. Something like this

<wf.*? pn=([^ >]+).*?>(.*?)<
Olaf Dietsche
  • 72,253
  • 8
  • 102
  • 198
  • thanks! This gives me the entire string "pn=group>Fulton_County_Grand_Jury" Will need to further extract group and Fulton_County_Grand_Jury. – serendipity Feb 02 '17 at 11:59
  • I modified your answer a tiny bit and am able to get both group and Fulton_County_Grand_Jury through m.group(1) and .mgroup(2). Funnily though, when I print these individually I get the correct values but when I try to print them together I get an exception saying "no match found" Here's the code : String str = "pn=group>Fulton_County_Grand_Jury"; final Matcher m = Pattern.compile("pn=([^ >]+).*?>(.*?)<").matcher(str); while (m.find()) System.out.println(m.group(1)); System.out.println(m.group(2)); – serendipity Feb 02 '17 at 12:04
  • This may be due to missing braces around both System.out.println – Olaf Dietsche Feb 02 '17 at 12:15
  • @ Olaf Dietsche Sorry, my bad :) – serendipity Feb 02 '17 at 12:19