rule

ChaoticNeutralCzech@feddit.org · edit-2 11 hours ago

By “inside first” I mean this Regex:

b"<(sampletag\d+)>([^<]*?)</\\1>"
# ^^^^^^^^^^^^^^^^ capture tag opening
#                 ^^^^^^^^ capture content, make sure no children
#                         ^^^^^^ detect tag closing

(part of a Python script; b because I’m parsing a mmapped binary file with NUL bytes that would ruin strings)

Yes, it only works for bottom-level XML tags, I’d need to remove each level with a Regex replace and re-run it to detect parent nodes. Presumably, the middle part could be improved to also detect tags as long as they don’t contain tags of the same type inside. Fortunately, the specific schema and the limited data I needed (strings) allowed me to just go over bottom-level elements.

I’d use an XML library but it’s not a valid XML file, it’s part of a raw image of a damaged drive with XML files. Very cursed. It worked in a pinch but you shouldn’t ever parse XML/HTML with Regex if you can avoid it with libraries like BeautifulSoup. By the way, some have used Regex to parse HTML, see Chad Scraper meme.