Extracting Subtitles from Polish Television

Begonnen von naszdom, Oktober 27, 2012, 14:40:19

« vorheriges - nächstes »

naszdom

TSDoctor Beta

I've been trying to extract Polish subtitles as an srt file.  I managed one and whilst it was complete the timings were all 03:20:06,662 --> 03:20:06,662.  The log file states:

First teletext header at 05:08:40.884 513
Last teletext header at 05:08:40.890 201

which possibly explains why I get a constant time setting. 

Is there a way round this and if so how do correct the gobbledegook characters in the subtitles.
VU+ uno, Technomate 5400 CI+,

Cypheros

I guess the PTS is constant or missing in a wide area. This timer is needed for every single substream (audio/video/text) to determinate the presentation time.

It would help alot to have a sample (10-20MByte) or a log from the TS-Doctor about the file.

naszdom

Thanks for replying Cypheros

I think the PTS is missing but I do not know whether that is the fault of the broadcaster or the STB.

A sample of the subtitles as well as the log was attached to my original post.
VU+ uno, Technomate 5400 CI+,

Cypheros

The STB is not responsible for this because the PTS in embedded deep into the PES structur.

Try TS-Dotor to fix the recording and get your subtitles. You can use the free 30 day trial version to check if it's working for you. To get the SRT subtitles, you just have to activate the export of teletext subtitles under settings/teletext.

naszdom

Thanks again Cypheros. 

I'll try the 30 day trial version tomorrow and let you know how I get on.
VU+ uno, Technomate 5400 CI+,

naszdom

Sorry Cypheros

Still the same.  The subtitles obtained from a fixed recoding are just the same.  They all have timings 00:00:00,000 --> 00:00:00,000

Here is the fixed_fixed log as you can see the is still virtually no difference between the First and Last teletext header.

I also attach the fixed_fixed_problem.txt.  Not that it means anything to me.
VU+ uno, Technomate 5400 CI+,

Cypheros

In the log I can see that there are 4 subtitle stream:

  Subtitle page: 778 [und]
  Subtitle page: 777 [und]
  Subtitle page: 779 [pol]
  Subtitle page: 776 [und]


I guess the problem is the same for all subtitles?

The strange thing is, that the teletext stream seems to have a PTS. Never saw this strange behaviour. Could you send me a short sample (~20MB) of an area of the recording that has subtitles?
You could use the Raw-Cutter from the Tools menu.

Email: support{at}cypheros.de


patak

The problem mentioned here is fixed in current releases. However Polish HBO is still causing troubles.
Every 5-10 teletext subtitle is cut in halves - same text but shown as two subtitles. It'd be awesome to merge them in TS-Doctor automatically. I haven't found any subs editor to make it either. I managed to prepare a 20MB-long sample if you're interested.

Example:
4
00:00:16,637 --> 00:00:20,805
ale szpital
zapewni jej wszelkie wygody.

5
00:00:20,840 --> 00:00:21,015
ale szpital
zapewni jej wszelkie wygody.

Cypheros

Yes, some samples would be great. I'll send you a FTP server accout to upload samples files, if you want.

Have no access to polish TV at the moment.

Djfe

one of our forum member (Mam) has written a simple programm for such a job some time ago
-> there were broadcasters that were showing words just as they were spoken which caused problems later on for him when he tried to find external players that would show subtitles with such fast changing times (they were ignoring parts of the subtitles), and the subtitles were a bit asynchronous as well

The attachment contains the .net Framework 4 source code ;)
http://forum.cypheros.de/index.php?topic=2215.msg12719#msg12719

this might or might not work for you

and a good editor for subtitles in general is subtitleedit
it's open source and support pretty much every subtitle format you need
http://www.nikse.dk/subtitleedit/

patak

#10
I'm heading for DreamSpark to pickup a copy of VS to check CleanSRT :)
Subtitle Edit is nothing new to me ;) It finds duplicate lines but merging them isn't possible automatically.
I haven't even thought that my C#/Java skills are probably good enough to wirte an app on my own :P

EDIT: CleanSRT cleans up files correctly :) However it seems like HBO is omitting some lines in TXT subs opposed to DVB subs

Cypheros

Thanks for the sample. Next beta version hopefully wil fix the problem by let the text stay for a longer Time.

1
00:00:01,780 --> 00:00:05,994
Przekaż jej, że jest bardzo chora.

2
00:00:09,682 --> 00:00:12,157
Nie wyzdrowieje.

3
00:00:14,435 --> 00:00:16,602
Jej stan będzie się pogarszał,

4
00:00:16,637 --> 00:00:21,015
ale szpital
zapewni jej wszelkie wygody.


Djfe

@cypheros does tsd merge these subtitles now? or does it change timing which could cause other issues with synchronicity

Cypheros

No, during merge the timers relation is not change. If a timer change is needed , all streams are changed with the same value.


www.cypheros.de