java - Regex to strip out greater than > and less than < characters from HTML string ignoring existing tags -


i have not lot of experience regular expression , have issue need replace instances of > , < &lt; , &gt; leave html tags in tack.

for example:

string string =" <p class=\"anotherclass\"> here text value h<sub>2</sub>o > 1 , < 100 <p>"; //need converted to: <p class=\"anotherclass\"> here text value h<sub>2</sub>o  &gt; 1 ,  &lt; 100 <p>"; 

i have tried , ahead , behind expressions can not seem of them work. example:

string string =" <p class=\"anotherclass\"> here text value h<sub>2</sub>) > 1 , < 100 <p>";  string reg1="<(?=[^>\\/]*<\\/)";   pattern p1 = pattern.compile(reg1);  test = p1.matcher(string).replaceall("&lt;"); 

does not seem have effect.

i wondered if else had come across before or if can give me guidance?

using regex alone "parse" html markup comes hefty caveats, many, many folks here on sa have commented on. however, request relatively modest.

naked < symbols between tags can found <(?=[^>]*(?:<|$)) , replaced &lt;.

naked > symbols between tags can found ((?:^|>)[^<]*?)> , replaced \1&gt;.

note both must done on whole string (not line). e.g. . must match \n, ^ must match beginning of string (not line), , $ must match end of string (not line).

note each must performed multiple times until no results left, since 1 replacement can made @ time between tags.

caveats:

  • this finds , replaces stray < or > symbols between tags, not in tags themselves. means mess on <a href="/link/with/</symbol/in/it">.
  • you should, if practical, have human check resulting changes validity, or @ least run through automated checker.
  • these regexes time-expensive, may not practical if speed issue.

to reiterate points made others, please consider markup parser instead, if doing work untrusted inputs.


Comments

Popular posts from this blog

Java 3D LWJGL collision -

spring - SubProtocolWebSocketHandler - No handlers -

methods - python can't use function in submodule -